Part 1. Data Preprocessing and Feature Engineering¶
The first phase of this project is dedicated to rigorous data preprocessing and feature engineering, which are critical steps to ensure the development of a robust and reliable credit risk prediction model. This stage involves transforming raw data into a clean and structured format, ready for effective modeling. The dataset used in this project originates from Lending Club’s loan data, covering a wide range of borrower and loan attributes collected between 2007 and 2018.
The data processing workflow begins with importing and inspecting the dataset to identify quality issues, inconsistencies, and missing values. An initial analysis of missing data patterns is performed, and features with extremely high proportions of missing values—deemed to contribute little or no value to modeling—are systematically dropped. For remaining variables with manageable levels of missingness, appropriate imputation strategies are applied to preserve the integrity of the dataset.
A crucial early step is the construction of the target variable, which represents the credit outcome (e.g., default or non-default) and is derived from relevant loan status fields. Once defined, the dataset is split into training and testing sets to ensure that model evaluation is performed on unseen data, thereby promoting generalizability and reducing overfitting.
Next, categorical variables are identified and encoded. Discrete features with meaningful class distributions are transformed using categorical encoding techniques that retain interpretability and predictive power. Simultaneously, continuous numerical variables are examined for multicollinearity using Variance Inflation Factor (VIF) analysis. Highly collinear features are removed to reduce redundancy and prevent unstable coefficient estimation in downstream modeling.
Following this, continuous variables are categorized using Weight of Evidence (WoE) binning—a technique that aligns with the monotonic relationship between predictors and the binary target variable. This also enables the calculation of Information Value (IV), which helps assess each feature’s predictive strength.
The final part of preprocessing includes extensive feature engineering, which involves constructing new variables, aggregating existing information, and deriving ratios or interaction terms that better capture the financial behavior of borrowers. These transformations are guided by both domain knowledge and data-driven insights.
After completing these steps, the resulting dataset comprises 344 curated features, which form the foundation for building and training the credit risk model in the subsequent stages of the project.
I. General Data Preparation¶
Import Libraries¶
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
import seaborn as sns
Dataset Description¶
The dataset used in this project contains detailed loan-level data from Lending Club, a prominent U.S. peer-to-peer lending platform. It covers accepted loan applications from 2007 to the fourth quarter of 2018, totaling more than 1 million individual loans (🔗 https://www.kaggle.com/datasets/thegurus/loan-data-accepted).
This dataset was obtained from Kaggle, and it includes a wide range of borrower and loan characteristics, such as:
- Applicant financial data: income, employment length, debt-to-income ratio, etc.
- Loan terms: loan amount, interest rate, installment, loan purpose.
- Credit history: delinquency counts, public records, revolving balance, credit age.
- Performance data: loan status, payment history, amount repaid, outstanding principal, and more.
To align with a real-world modeling scenario, we divided the dataset into two subsets:
A training set consisting of loan applications up to a given point in time, used for developing the Expected Loss (EL) components—Probability of Default (PD), Loss Given Default (LGD), and Exposure at Default (EAD).
A test set containing applications submitted after the PD model was trained, used to evaluate how well the model generalizes to new data.
This approach enables us to test the temporal robustness of the PD model and assess whether newer applicants exhibit similar characteristics to those in the historical training data.
# Import data (accepted and rejected demands).
loan_data_accepted = pd.read_csv('C:/Disc D/365DataScience/Credit risk modeling/Self_project/Data2/accepted_2007_to_2018Q4.csv',low_memory=False)
loan_data_rejected = pd.read_csv('C:/Disc D/365DataScience/Credit risk modeling/Self_project/Data2/rejected_2007_to_2018Q4.csv',low_memory=False)
# Copy the dataframe.
loan_data_accepted = loan_data_accepted.copy()
loan_data_rejected = loan_data_rejected.copy()
# Expose the first 5 rows of the accepted loans.
pd.options.display.max_columns = None
loan_data_accepted.head()
| id | member_id | loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | emp_title | emp_length | home_ownership | annual_inc | verification_status | issue_d | loan_status | pymnt_plan | url | desc | purpose | title | zip_code | addr_state | dti | delinq_2yrs | earliest_cr_line | fico_range_low | fico_range_high | inq_last_6mths | mths_since_last_delinq | mths_since_last_record | open_acc | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | out_prncp | out_prncp_inv | total_pymnt | total_pymnt_inv | total_rec_prncp | total_rec_int | total_rec_late_fee | recoveries | collection_recovery_fee | last_pymnt_d | last_pymnt_amnt | next_pymnt_d | last_credit_pull_d | last_fico_range_high | last_fico_range_low | collections_12_mths_ex_med | mths_since_last_major_derog | policy_code | application_type | annual_inc_joint | dti_joint | verification_status_joint | acc_now_delinq | tot_coll_amt | tot_cur_bal | open_acc_6m | open_act_il | open_il_12m | open_il_24m | mths_since_rcnt_il | total_bal_il | il_util | open_rv_12m | open_rv_24m | max_bal_bc | all_util | total_rev_hi_lim | inq_fi | total_cu_tl | inq_last_12m | acc_open_past_24mths | avg_cur_bal | bc_open_to_buy | bc_util | chargeoff_within_12_mths | delinq_amnt | mo_sin_old_il_acct | mo_sin_old_rev_tl_op | mo_sin_rcnt_rev_tl_op | mo_sin_rcnt_tl | mort_acc | mths_since_recent_bc | mths_since_recent_bc_dlq | mths_since_recent_inq | mths_since_recent_revol_delinq | num_accts_ever_120_pd | num_actv_bc_tl | num_actv_rev_tl | num_bc_sats | num_bc_tl | num_il_tl | num_op_rev_tl | num_rev_accts | num_rev_tl_bal_gt_0 | num_sats | num_tl_120dpd_2m | num_tl_30dpd | num_tl_90g_dpd_24m | num_tl_op_past_12m | pct_tl_nvr_dlq | percent_bc_gt_75 | pub_rec_bankruptcies | tax_liens | tot_hi_cred_lim | total_bal_ex_mort | total_bc_limit | total_il_high_credit_limit | revol_bal_joint | sec_app_fico_range_low | sec_app_fico_range_high | sec_app_earliest_cr_line | sec_app_inq_last_6mths | sec_app_mort_acc | sec_app_open_acc | sec_app_revol_util | sec_app_open_act_il | sec_app_num_rev_accts | sec_app_chargeoff_within_12_mths | sec_app_collections_12_mths_ex_med | sec_app_mths_since_last_major_derog | hardship_flag | hardship_type | hardship_reason | hardship_status | deferral_term | hardship_amount | hardship_start_date | hardship_end_date | payment_plan_start_date | hardship_length | hardship_dpd | hardship_loan_status | orig_projected_additional_accrued_interest | hardship_payoff_balance_amount | hardship_last_payment_amount | disbursement_method | debt_settlement_flag | debt_settlement_flag_date | settlement_status | settlement_date | settlement_amount | settlement_percentage | settlement_term | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 68407277 | NaN | 3600.0 | 3600.0 | 3600.0 | 36 months | 13.99 | 123.03 | C | C4 | leadman | 10+ years | MORTGAGE | 55000.0 | Not Verified | Dec-2015 | Fully Paid | n | https://lendingclub.com/browse/loanDetail.acti... | NaN | debt_consolidation | Debt consolidation | 190xx | PA | 5.91 | 0.0 | Aug-2003 | 675.0 | 679.0 | 1.0 | 30.0 | NaN | 7.0 | 0.0 | 2765.0 | 29.7 | 13.0 | w | 0.00 | 0.00 | 4421.723917 | 4421.72 | 3600.00 | 821.72 | 0.0 | 0.0 | 0.0 | Jan-2019 | 122.67 | NaN | Mar-2019 | 564.0 | 560.0 | 0.0 | 30.0 | 1.0 | Individual | NaN | NaN | NaN | 0.0 | 722.0 | 144904.0 | 2.0 | 2.0 | 0.0 | 1.0 | 21.0 | 4981.0 | 36.0 | 3.0 | 3.0 | 722.0 | 34.0 | 9300.0 | 3.0 | 1.0 | 4.0 | 4.0 | 20701.0 | 1506.0 | 37.2 | 0.0 | 0.0 | 148.0 | 128.0 | 3.0 | 3.0 | 1.0 | 4.0 | 69.0 | 4.0 | 69.0 | 2.0 | 2.0 | 4.0 | 2.0 | 5.0 | 3.0 | 4.0 | 9.0 | 4.0 | 7.0 | 0.0 | 0.0 | 0.0 | 3.0 | 76.9 | 0.0 | 0.0 | 0.0 | 178050.0 | 7746.0 | 2400.0 | 13734.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Cash | N | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 68355089 | NaN | 24700.0 | 24700.0 | 24700.0 | 36 months | 11.99 | 820.28 | C | C1 | Engineer | 10+ years | MORTGAGE | 65000.0 | Not Verified | Dec-2015 | Fully Paid | n | https://lendingclub.com/browse/loanDetail.acti... | NaN | small_business | Business | 577xx | SD | 16.06 | 1.0 | Dec-1999 | 715.0 | 719.0 | 4.0 | 6.0 | NaN | 22.0 | 0.0 | 21470.0 | 19.2 | 38.0 | w | 0.00 | 0.00 | 25679.660000 | 25679.66 | 24700.00 | 979.66 | 0.0 | 0.0 | 0.0 | Jun-2016 | 926.35 | NaN | Mar-2019 | 699.0 | 695.0 | 0.0 | NaN | 1.0 | Individual | NaN | NaN | NaN | 0.0 | 0.0 | 204396.0 | 1.0 | 1.0 | 0.0 | 1.0 | 19.0 | 18005.0 | 73.0 | 2.0 | 3.0 | 6472.0 | 29.0 | 111800.0 | 0.0 | 0.0 | 6.0 | 4.0 | 9733.0 | 57830.0 | 27.1 | 0.0 | 0.0 | 113.0 | 192.0 | 2.0 | 2.0 | 4.0 | 2.0 | NaN | 0.0 | 6.0 | 0.0 | 5.0 | 5.0 | 13.0 | 17.0 | 6.0 | 20.0 | 27.0 | 5.0 | 22.0 | 0.0 | 0.0 | 0.0 | 2.0 | 97.4 | 7.7 | 0.0 | 0.0 | 314017.0 | 39475.0 | 79300.0 | 24667.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Cash | N | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 68341763 | NaN | 20000.0 | 20000.0 | 20000.0 | 60 months | 10.78 | 432.66 | B | B4 | truck driver | 10+ years | MORTGAGE | 63000.0 | Not Verified | Dec-2015 | Fully Paid | n | https://lendingclub.com/browse/loanDetail.acti... | NaN | home_improvement | NaN | 605xx | IL | 10.78 | 0.0 | Aug-2000 | 695.0 | 699.0 | 0.0 | NaN | NaN | 6.0 | 0.0 | 7869.0 | 56.2 | 18.0 | w | 0.00 | 0.00 | 22705.924294 | 22705.92 | 20000.00 | 2705.92 | 0.0 | 0.0 | 0.0 | Jun-2017 | 15813.30 | NaN | Mar-2019 | 704.0 | 700.0 | 0.0 | NaN | 1.0 | Joint App | 71000.0 | 13.85 | Not Verified | 0.0 | 0.0 | 189699.0 | 0.0 | 1.0 | 0.0 | 4.0 | 19.0 | 10827.0 | 73.0 | 0.0 | 2.0 | 2081.0 | 65.0 | 14000.0 | 2.0 | 5.0 | 1.0 | 6.0 | 31617.0 | 2737.0 | 55.9 | 0.0 | 0.0 | 125.0 | 184.0 | 14.0 | 14.0 | 5.0 | 101.0 | NaN | 10.0 | NaN | 0.0 | 2.0 | 3.0 | 2.0 | 4.0 | 6.0 | 4.0 | 7.0 | 3.0 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 100.0 | 50.0 | 0.0 | 0.0 | 218418.0 | 18696.0 | 6200.0 | 14877.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Cash | N | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 66310712 | NaN | 35000.0 | 35000.0 | 35000.0 | 60 months | 14.85 | 829.90 | C | C5 | Information Systems Officer | 10+ years | MORTGAGE | 110000.0 | Source Verified | Dec-2015 | Current | n | https://lendingclub.com/browse/loanDetail.acti... | NaN | debt_consolidation | Debt consolidation | 076xx | NJ | 17.06 | 0.0 | Sep-2008 | 785.0 | 789.0 | 0.0 | NaN | NaN | 13.0 | 0.0 | 7802.0 | 11.6 | 17.0 | w | 15897.65 | 15897.65 | 31464.010000 | 31464.01 | 19102.35 | 12361.66 | 0.0 | 0.0 | 0.0 | Feb-2019 | 829.90 | Apr-2019 | Mar-2019 | 679.0 | 675.0 | 0.0 | NaN | 1.0 | Individual | NaN | NaN | NaN | 0.0 | 0.0 | 301500.0 | 1.0 | 1.0 | 0.0 | 1.0 | 23.0 | 12609.0 | 70.0 | 1.0 | 1.0 | 6987.0 | 45.0 | 67300.0 | 0.0 | 1.0 | 0.0 | 2.0 | 23192.0 | 54962.0 | 12.1 | 0.0 | 0.0 | 36.0 | 87.0 | 2.0 | 2.0 | 1.0 | 2.0 | NaN | NaN | NaN | 0.0 | 4.0 | 5.0 | 8.0 | 10.0 | 2.0 | 10.0 | 13.0 | 5.0 | 13.0 | 0.0 | 0.0 | 0.0 | 1.0 | 100.0 | 0.0 | 0.0 | 0.0 | 381215.0 | 52226.0 | 62500.0 | 18000.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Cash | N | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 68476807 | NaN | 10400.0 | 10400.0 | 10400.0 | 60 months | 22.45 | 289.91 | F | F1 | Contract Specialist | 3 years | MORTGAGE | 104433.0 | Source Verified | Dec-2015 | Fully Paid | n | https://lendingclub.com/browse/loanDetail.acti... | NaN | major_purchase | Major purchase | 174xx | PA | 25.37 | 1.0 | Jun-1998 | 695.0 | 699.0 | 3.0 | 12.0 | NaN | 12.0 | 0.0 | 21929.0 | 64.5 | 35.0 | w | 0.00 | 0.00 | 11740.500000 | 11740.50 | 10400.00 | 1340.50 | 0.0 | 0.0 | 0.0 | Jul-2016 | 10128.96 | NaN | Mar-2018 | 704.0 | 700.0 | 0.0 | NaN | 1.0 | Individual | NaN | NaN | NaN | 0.0 | 0.0 | 331730.0 | 1.0 | 3.0 | 0.0 | 3.0 | 14.0 | 73839.0 | 84.0 | 4.0 | 7.0 | 9702.0 | 78.0 | 34000.0 | 2.0 | 1.0 | 3.0 | 10.0 | 27644.0 | 4567.0 | 77.5 | 0.0 | 0.0 | 128.0 | 210.0 | 4.0 | 4.0 | 6.0 | 4.0 | 12.0 | 1.0 | 12.0 | 0.0 | 4.0 | 6.0 | 5.0 | 9.0 | 10.0 | 7.0 | 19.0 | 6.0 | 12.0 | 0.0 | 0.0 | 0.0 | 4.0 | 96.6 | 60.0 | 0.0 | 0.0 | 439570.0 | 95768.0 | 20300.0 | 88097.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Cash | N | NaN | NaN | NaN | NaN | NaN | NaN |
# Expose the first 5 rows of the rejected loans.
loan_data_rejected.sample(5)
| Amount Requested | Application Date | Loan Title | Risk_Score | Debt-To-Income Ratio | Zip Code | State | Employment Length | Policy Code | |
|---|---|---|---|---|---|---|---|---|---|
| 8910244 | 15000.0 | 2014-08-21 | debt_consolidation | 603.0 | 32.33% | 471xx | IN | < 1 year | 0.0 |
| 12365756 | 25000.0 | 2017-08-06 | Credit card refinancing | 622.0 | 46.58% | 713xx | LA | < 1 year | 0.0 |
| 18011080 | 25000.0 | 2018-05-01 | Debt consolidation | NaN | 54.75% | 322xx | FL | < 1 year | 0.0 |
| 11160554 | 15000.0 | 2018-03-13 | Debt consolidation | NaN | 11.5% | 112xx | NY | < 1 year | 0.0 |
| 17248026 | 2500.0 | 2018-12-31 | Car financing | NaN | 0.54% | 453xx | OH | 1 year | 0.0 |
loan_data_accepted.shape
(2260701, 151)
Explore Data¶
loan_data = loan_data_accepted.copy()
pd.options.display.max_columns = None
#pd.options.display.max_rows = None
# Sets the pandas dataframe options to display all columns/ rows.
loan_data.columns.values
# Displays all column names.
array(['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
'term', 'int_rate', 'installment', 'grade', 'sub_grade',
'emp_title', 'emp_length', 'home_ownership', 'annual_inc',
'verification_status', 'issue_d', 'loan_status', 'pymnt_plan',
'url', 'desc', 'purpose', 'title', 'zip_code', 'addr_state', 'dti',
'delinq_2yrs', 'earliest_cr_line', 'fico_range_low',
'fico_range_high', 'inq_last_6mths', 'mths_since_last_delinq',
'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
'revol_util', 'total_acc', 'initial_list_status', 'out_prncp',
'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee',
'recoveries', 'collection_recovery_fee', 'last_pymnt_d',
'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d',
'last_fico_range_high', 'last_fico_range_low',
'collections_12_mths_ex_med', 'mths_since_last_major_derog',
'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint',
'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt',
'tot_cur_bal', 'open_acc_6m', 'open_act_il', 'open_il_12m',
'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util',
'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util',
'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m',
'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util',
'chargeoff_within_12_mths', 'delinq_amnt', 'mo_sin_old_il_acct',
'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl',
'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq',
'mths_since_recent_inq', 'mths_since_recent_revol_delinq',
'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl',
'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl',
'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats',
'num_tl_120dpd_2m', 'num_tl_30dpd', 'num_tl_90g_dpd_24m',
'num_tl_op_past_12m', 'pct_tl_nvr_dlq', 'percent_bc_gt_75',
'pub_rec_bankruptcies', 'tax_liens', 'tot_hi_cred_lim',
'total_bal_ex_mort', 'total_bc_limit',
'total_il_high_credit_limit', 'revol_bal_joint',
'sec_app_fico_range_low', 'sec_app_fico_range_high',
'sec_app_earliest_cr_line', 'sec_app_inq_last_6mths',
'sec_app_mort_acc', 'sec_app_open_acc', 'sec_app_revol_util',
'sec_app_open_act_il', 'sec_app_num_rev_accts',
'sec_app_chargeoff_within_12_mths',
'sec_app_collections_12_mths_ex_med',
'sec_app_mths_since_last_major_derog', 'hardship_flag',
'hardship_type', 'hardship_reason', 'hardship_status',
'deferral_term', 'hardship_amount', 'hardship_start_date',
'hardship_end_date', 'payment_plan_start_date', 'hardship_length',
'hardship_dpd', 'hardship_loan_status',
'orig_projected_additional_accrued_interest',
'hardship_payoff_balance_amount', 'hardship_last_payment_amount',
'disbursement_method', 'debt_settlement_flag',
'debt_settlement_flag_date', 'settlement_status',
'settlement_date', 'settlement_amount', 'settlement_percentage',
'settlement_term'], dtype=object)
loan_data.info()
# Displays column names, complete (non-missing) cases per column, and datatype per column.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2260701 entries, 0 to 2260700 Columns: 151 entries, id to settlement_term dtypes: float64(113), object(38) memory usage: 2.5+ GB
General Preprocessing¶
Preprocessing few continuous variables¶
Variable 'emp_length'¶
loan_data['emp_length'].unique()
# Displays unique values of a column.
array(['10+ years', '3 years', '4 years', '6 years', '1 year', '7 years',
'8 years', '5 years', '2 years', '9 years', '< 1 year', nan],
dtype=object)
loan_data['emp_length_int'] = loan_data['emp_length'].str.replace('\\+ years', '', regex=True)
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace('< 1 year', str(0))
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace('n/a', str(0))
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace(' years', '')
loan_data['emp_length_int'] = loan_data['emp_length_int'].str.replace(' year', '')
# We store the preprocessed ‘employment length’ variable in a new variable called ‘employment length int’,
# We assign the new ‘employment length int’ to be equal to the ‘employment length’ variable with the string ‘+ years’
# replaced with nothing. Next, we replace the whole string ‘less than 1 year’ with the string ‘0’.
# Then, we replace the ‘n/a’ string with the string ‘0’. Then, we replace the string ‘space years’ with nothing.
# Finally, we replace the string ‘space year’ with nothing.
type(loan_data['emp_length_int'][0])
# Checks the datatype of a single element of a column.
str
loan_data['emp_length_int'].value_counts()
emp_length_int 10 748005 2 203677 0 189988 3 180753 1 148403 5 139698 4 136605 6 102628 7 92695 8 91914 9 79395 Name: count, dtype: int64
# Replace "Unknown" with NaN and then convert to numeric
loan_data['emp_length_int'] = loan_data['emp_length_int'].replace('Unknown', np.nan)
# Transforms the values to numeric.
loan_data['emp_length_int'] = pd.to_numeric(loan_data['emp_length_int'])
Fill with the mode (most frequent)¶
loan_data = loan_data.copy()
# Mode imputation
loan_data['emp_length_int'] = loan_data['emp_length_int'].fillna(loan_data['emp_length_int'].mode()[0])
loan_data['emp_length_int'].value_counts()
emp_length_int 10.0 894945 2.0 203677 0.0 189988 3.0 180753 1.0 148403 5.0 139698 4.0 136605 6.0 102628 7.0 92695 8.0 91914 9.0 79395 Name: count, dtype: int64
# Checking the percentage of the missing values for each category
loan_data['emp_length_int'].isnull().mean()
0.0
Convert string variable to date and time¶
loan_data = loan_data.drop(columns = ['emp_length'])
Variable 'term'¶
loan_data['term'].describe()
# Shows some descriptive statisics for the values of a column.
count 2260668 unique 2 top 36 months freq 1609754 Name: term, dtype: object
loan_data['term_int'] = loan_data['term'].str.replace(' months', '')
# We replace a string with another string, in this case, with an empty strng (i.e. with nothing).
loan_data['term_int'].sample(5)
1790274 60 1752897 36 1594219 36 175338 60 1387009 60 Name: term_int, dtype: object
type(loan_data['term_int'][25])
# Checks the datatype of a single element of a column.
str
loan_data['term_int'] = pd.to_numeric(loan_data['term'].str.replace(' months', ''))
# We remplace a string from a variable with another string, in this case, with an empty string (i.e. with nothing).
# We turn the result to numeric datatype and save it in another variable.
loan_data['term_int'].sample(5)
1682884 36.0 575675 36.0 1187779 36.0 821812 36.0 1579849 36.0 Name: term_int, dtype: float64
type(loan_data['term_int'][0])
# Checks the datatype of a single element of a column.
numpy.float64
Variable 'issue_d'¶
loan_data['issue_d'].sample(5)
1662795 Mar-2017 1818823 Jul-2013 270902 May-2015 342562 Mar-2015 1418687 Nov-2018 Name: issue_d, dtype: object
# Assume we are now in December 2020
loan_data['issue_d_date'] = pd.to_datetime(loan_data['issue_d'], format='mixed', errors='coerce')
# Extracts the date and the time from a string variable that is in a given format.
loan_data['mths_since_issue_d'] = round(pd.to_numeric((pd.to_datetime('2020-12-01') - loan_data['issue_d_date']) / np.timedelta64(30, 'D')))
# We calculate the difference between two dates in months, turn it to numeric datatype and round it.
# We save the result in a new variable.
loan_data['mths_since_issue_d'].describe()
# Shows some descriptive statisics for the values of a column.
count 2.260668e+06 mean 5.581604e+01 std 2.189573e+01 min 2.400000e+01 25% 3.800000e+01 50% 5.400000e+01 75% 6.900000e+01 max 1.640000e+02 Name: mths_since_issue_d, dtype: float64
Check for missing values and clean¶
# Checking the percentage of the missing values for each category
missing_percent = loan_data.isnull().mean() * 100
missing_percent = missing_percent[missing_percent >= 0].sort_values(ascending=False)
# Set option to display all rows
pd.set_option('display.max_rows', None)
# Display the result
print(missing_percent)
member_id 100.000000 orig_projected_additional_accrued_interest 99.617331 hardship_loan_status 99.517097 hardship_last_payment_amount 99.517097 deferral_term 99.517097 hardship_status 99.517097 hardship_reason 99.517097 hardship_type 99.517097 hardship_end_date 99.517097 payment_plan_start_date 99.517097 hardship_length 99.517097 hardship_dpd 99.517097 hardship_amount 99.517097 hardship_payoff_balance_amount 99.517097 hardship_start_date 99.517097 settlement_date 98.485160 debt_settlement_flag_date 98.485160 settlement_term 98.485160 settlement_status 98.485160 settlement_percentage 98.485160 settlement_amount 98.485160 sec_app_mths_since_last_major_derog 98.410139 sec_app_revol_util 95.303050 revol_bal_joint 95.221836 sec_app_mort_acc 95.221792 sec_app_fico_range_low 95.221792 sec_app_chargeoff_within_12_mths 95.221792 sec_app_collections_12_mths_ex_med 95.221792 sec_app_inq_last_6mths 95.221792 sec_app_num_rev_accts 95.221792 sec_app_open_act_il 95.221792 sec_app_open_acc 95.221792 sec_app_earliest_cr_line 95.221792 sec_app_fico_range_high 95.221792 verification_status_joint 94.880791 dti_joint 94.660683 annual_inc_joint 94.660506 desc 94.423632 mths_since_last_record 84.113069 mths_since_recent_bc_dlq 77.011511 mths_since_last_major_derog 74.309960 mths_since_recent_revol_delinq 67.250910 next_pymnt_d 59.509993 mths_since_last_delinq 51.246715 il_util 47.281042 mths_since_rcnt_il 40.251099 all_util 38.323555 total_cu_tl 38.313912 open_acc_6m 38.313912 inq_last_12m 38.313912 open_rv_12m 38.313868 open_rv_24m 38.313868 total_bal_il 38.313868 max_bal_bc 38.313868 open_il_24m 38.313868 open_il_12m 38.313868 inq_fi 38.313868 open_act_il 38.313868 mths_since_recent_inq 13.069751 emp_title 7.387178 num_tl_120dpd_2m 6.798334 mo_sin_old_il_acct 6.153136 bc_util 3.366389 percent_bc_gt_75 3.335779 bc_open_to_buy 3.316140 mths_since_recent_bc 3.248771 pct_tl_nvr_dlq 3.116909 avg_cur_bal 3.113149 mo_sin_rcnt_rev_tl_op 3.110097 num_rev_accts 3.110097 mo_sin_old_rev_tl_op 3.110097 num_actv_bc_tl 3.110053 mo_sin_rcnt_tl 3.110053 num_actv_rev_tl 3.110053 num_accts_ever_120_pd 3.110053 total_il_high_credit_limit 3.110053 num_il_tl 3.110053 num_bc_tl 3.110053 total_rev_hi_lim 3.110053 tot_hi_cred_lim 3.110053 num_tl_op_past_12m 3.110053 num_tl_90g_dpd_24m 3.110053 num_tl_30dpd 3.110053 num_rev_tl_bal_gt_0 3.110053 num_op_rev_tl 3.110053 tot_cur_bal 3.110053 tot_coll_amt 3.110053 num_bc_sats 2.593134 num_sats 2.593134 total_bc_limit 2.214490 total_bal_ex_mort 2.214490 acc_open_past_24mths 2.214490 mort_acc 2.214490 title 1.033264 last_pymnt_d 0.108816 revol_util 0.081170 dti 0.077144 pub_rec_bankruptcies 0.061839 collections_12_mths_ex_med 0.007874 chargeoff_within_12_mths 0.007874 tax_liens 0.006104 last_credit_pull_d 0.004645 inq_last_6mths 0.002787 earliest_cr_line 0.002743 pub_rec 0.002743 open_acc 0.002743 delinq_2yrs 0.002743 acc_now_delinq 0.002743 delinq_amnt 0.002743 total_acc 0.002743 annual_inc 0.001637 zip_code 0.001504 disbursement_method 0.001460 hardship_flag 0.001460 debt_settlement_flag 0.001460 term_int 0.001460 issue_d_date 0.001460 mths_since_issue_d 0.001460 application_type 0.001460 sub_grade 0.001460 url 0.001460 pymnt_plan 0.001460 loan_status 0.001460 issue_d 0.001460 verification_status 0.001460 home_ownership 0.001460 grade 0.001460 policy_code 0.001460 installment 0.001460 int_rate 0.001460 term 0.001460 funded_amnt_inv 0.001460 funded_amnt 0.001460 loan_amnt 0.001460 purpose 0.001460 addr_state 0.001460 fico_range_low 0.001460 fico_range_high 0.001460 last_fico_range_low 0.001460 last_fico_range_high 0.001460 last_pymnt_amnt 0.001460 collection_recovery_fee 0.001460 recoveries 0.001460 total_rec_late_fee 0.001460 total_rec_int 0.001460 total_rec_prncp 0.001460 total_pymnt_inv 0.001460 total_pymnt 0.001460 out_prncp_inv 0.001460 out_prncp 0.001460 initial_list_status 0.001460 revol_bal 0.001460 emp_length_int 0.000000 id 0.000000 dtype: float64
The following features with missing values > 90% can be dropped:¶
- member_id: Completely missing (100%) — useless for modeling
- orig_projected_additional_accrued_interest: Nearly all missing (~99.6%), irrelevant to creditworthiness
- hardship_... features: ~99.5% missing, very sparse; relevant only to specialized hardship analysis
- settlement_... features: ~98.5% missing, specific to post-default negotiation — not useful for default prediction before loan approval
- sec_app_... features: ~95% missing, refer to secondary applicants — only useful for joint applications, which are rare
- revol_bal_joint: ~95.2% missing, same reason as above
- verification_status_joint, dti_joint, annual_inc_joint: ~94.6–94.8% missing, tied to joint applications — can be dropped for general credit risk modeling
- desc: (94.4%) very sparse and unstructured (free text); not useful unless we plan to do NLP
Why it makes sense to drop them:
In credit risk modeling, especially when we are focused on individual (non-joint) loans, these features:
- Won’t contribute meaningfully to predictive power
- Might introduce noise or overfitting due to their sparsity
- Can increase memory and computation time unnecessarily
features_to_drop = [
'member_id', 'orig_projected_additional_accrued_interest',
'hardship_end_date', 'deferral_term', 'hardship_status',
'hardship_reason', 'hardship_type', 'hardship_payoff_balance_amount',
'hardship_last_payment_amount', 'payment_plan_start_date',
'hardship_amount', 'hardship_loan_status', 'hardship_start_date',
'hardship_dpd', 'hardship_length', 'debt_settlement_flag_date',
'settlement_date', 'settlement_amount', 'settlement_percentage',
'settlement_term', 'settlement_status', 'sec_app_mths_since_last_major_derog',
'sec_app_revol_util', 'revol_bal_joint', 'sec_app_inq_last_6mths',
'sec_app_num_rev_accts', 'sec_app_open_act_il', 'sec_app_open_acc',
'sec_app_mort_acc', 'sec_app_chargeoff_within_12_mths',
'sec_app_collections_12_mths_ex_med', 'sec_app_fico_range_low',
'sec_app_earliest_cr_line', 'sec_app_fico_range_high',
'verification_status_joint', 'dti_joint', 'annual_inc_joint', 'desc'
]
# Drop the features with very high missing values (>90%)
loan_data = loan_data.drop(columns=features_to_drop)
Considering the following three features:¶
- mths_since_last_record (84.1%): Credit delay history — could be useful, but very sparse
- mths_since_recent_bc_dlq (77.0%): Credit delay (bank card delinquency) — consider keeping if strongly predictive
- mths_since_last_major_derog (74.3%): Major derogatory marks — credit-relevant, but high sparsity
Recommendation:
- Run a correlation or feature importance test (like Random Forest feature importance).
- If any of them has low predictive power, it will be droped.
- If you're prioritizing simplicity and generalizability, it’s okay to drop all three.
Evaluation of the predictive power of the three delay-related features using a Random Forest classifier and feature importance
# Step 1: Choose the relevant features and the target
delay_features = [
'mths_since_last_record',
'mths_since_recent_bc_dlq',
'mths_since_last_major_derog'
]
# Example: Binary target variable preparation (adjust based on your dataset)
# Replace with appropriate mapping depending on your version of 'loan_status'
loan_data['target'] = loan_data['loan_status'].apply(lambda x: 1 if x in ['Charged Off', 'Default'] else 0)
# Step 2: Subset the data and drop rows with missing values in selected columns
df = loan_data[delay_features + ['target']].dropna()
# Step 3: Split into train/test
X = df[delay_features]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Step 4: Train a Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Step 5: Feature importance visualization
importances = model.feature_importances_
feat_importance = pd.Series(importances, index=delay_features).sort_values(ascending=True)
print(importances)
plt.figure(figsize=(8, 4))
sns.barplot(x=feat_importance, y=feat_importance.index, palette='viridis')
plt.title("Feature Importance (Random Forest)")
plt.xlabel("Importance Score")
plt.ylabel("Delay Feature")
plt.tight_layout()
plt.show()
[0.40604894 0.29841302 0.29553803]
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2205266688.py:38: FutureWarning: Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect. sns.barplot(x=feat_importance, y=feat_importance.index, palette='viridis')
Interpretation:¶
- mths_since_last_record contributes most to predicting credit risk among the three, capturing about 40.6% of the predictive power in your mini-model.
- The other two features are almost equally important and still carry decent predictive power (~29.8% and ~29.6%).
Since all three features have reasonable importance, we choose to Keep all of them as we are optimizing for model performance and don’t mind some sparsity due to missing values (or plan to impute them).
Fill Missing values of these 3 features:¶
These columns represent number of months since a certain delinquency event, so missing values usually mean "the event never occurred".
- Best Practice: Impute with a large number (e.g., 999) to indicate “never had this event”.
- This maintains the numeric nature of the variable and distinguishes between recent vs never.
# List of the features
delay_features = ['mths_since_last_record', 'mths_since_recent_bc_dlq', 'mths_since_last_major_derog']
# Fill missing values with 999
loan_data[delay_features] = loan_data[delay_features].fillna(999)
Features with moderate missingness (around 38–67%)¶
For these features, the imputation strategy depends on the data type and business logic of each feature. Since our project focuses on credit risk modeling, we’ll treat these with care.
- Date-related / "Months since": Fill with a high number (e.g., 999) to indicate "No activity" or "Never delinquent".
- Count-type features: Fill with 0 (meaning "no record", e.g., 0 inquiries).
- Ratio/Utilization (%): Fill with median or domain-specific value (e.g., 0 or 100%).
Recommended Fill Strategies:
# Fill 'months since' features with high value to indicate 'no delinquency'
loan_data['mths_since_recent_revol_delinq'] = loan_data['mths_since_recent_revol_delinq'].fillna(999)
loan_data['mths_since_last_delinq'] = loan_data['mths_since_last_delinq'].fillna(999)
loan_data['mths_since_rcnt_il'] = loan_data['mths_since_rcnt_il'].fillna(999)
# Date - next scheduled payment, could be missing because loan is fully paid off
loan_data['next_pymnt_d'] = loan_data['next_pymnt_d'].fillna('No Payment Due') # or use pd.NaT if you prefer datetime
# Utilization ratios - fill with median or 0
loan_data['il_util'] = loan_data['il_util'].fillna(loan_data['il_util'].median())
loan_data['all_util'] = loan_data['all_util'].fillna(loan_data['all_util'].median())
# Counts / Frequency - fill with 0 (no activity)
count_cols = [
'total_cu_tl', 'open_acc_6m', 'inq_last_12m', 'total_bal_il', 'max_bal_bc',
'open_il_12m', 'open_act_il', 'inq_fi', 'open_rv_12m', 'open_rv_24m', 'open_il_24m'
]
loan_data[count_cols] = loan_data[count_cols].fillna(0)
Features with missing values ranging from ~1% to ~13%¶
For the features listed above (with missing values ranging from ~1% to ~13%), here's a tailored imputation strategy that balances practicality, model performance, and data integrity for your credit risk modeling project.
Grouping by Feature Type and Imputation Strategy:
Delinquency / Inquiries / Behavior:
- mths_since_recent_inq : (Fill with 999) ; Indicates no inquiries (consistent with other "mths_since" logic).
- num_tl_120dpd_2m : (Fill with 0) ; No delinquency.
Account age / timelines:
- mo_sin_old_il_acct, mo_sin_old_rev_tl_op, mo_sin_rcnt_rev_tl_op, mo_sin_rcnt_tl : (Fill with median) ; Age-based numeric values.
Utilization/ratio:
- bc_util, percent_bc_gt_75, pct_tl_nvr_dlq, all_util : (Fill with median or domain-specific values (e.g., 0)) ; Percent values.
Balances / credit limits:
- bc_open_to_buy, avg_cur_bal, total_rev_hi_lim, tot_cur_bal, total_il_high_credit_limit, tot_hi_cred_lim, total_bc_limit, total_bal_ex_mort : (Fill with median) ; Dollar amounts; median avoids skew.
Count features (accounts, inquiries): (Fill with 0) ; No accounts or events (safe assumption).
- Features: num_rev_accts, num_accts_ever_120_pd, num_actv_bc_tl, num_actv_rev_tl, num_rev_tl_bal_gt_0, num_tl_90g_dpd_24m, num_tl_30dpd, num_tl_op_past_12m, num_op_rev_tl, num_il_tl, num_bc_tl, num_bc_sats, num_sats, mort_acc, acc_open_past_24mths, tot_coll_amt
Loan metadata (text): 'title'; Fill with "Unknown" or drop; Optional unstructured text.
mths_since_recent_bc: (3.25% missing)
- Meaning: Number of months since the borrower's most recent bankcard account opened.
- Type: Numeric, continuous (likely integer).
- Strategy: Use median imputation — this is robust to outliers and appropriate for time-based features.
- Employment-related:
- emp_title : ("Unknown" or "Other") ; Free text, no standard format. Optionally drop or encode later.
loan_data = loan_data.drop(columns=['emp_title'])
# Months since events
loan_data['mths_since_recent_inq'] = loan_data['mths_since_recent_inq'].fillna(999)
# Utilization/ratio: median
ratio_cols = ['bc_util', 'percent_bc_gt_75', 'pct_tl_nvr_dlq']
loan_data[ratio_cols] = loan_data[ratio_cols].fillna(loan_data[ratio_cols].median())
# Timeline features: median
timeline_cols = ['mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl']
loan_data[timeline_cols] = loan_data[timeline_cols].fillna(loan_data[timeline_cols].median())
# Balances and limits: median
balance_cols = [
'bc_open_to_buy', 'avg_cur_bal', 'total_rev_hi_lim', 'tot_cur_bal',
'total_il_high_credit_limit', 'tot_hi_cred_lim', 'total_bc_limit', 'total_bal_ex_mort'
]
loan_data[balance_cols] = loan_data[balance_cols].fillna(loan_data[balance_cols].median())
# Count features: fill with 0
count_cols = [
'num_rev_accts', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl',
'num_rev_tl_bal_gt_0', 'num_tl_90g_dpd_24m', 'num_tl_30dpd', 'num_tl_op_past_12m',
'num_op_rev_tl', 'num_il_tl', 'num_bc_tl', 'num_bc_sats', 'num_sats',
'mort_acc', 'acc_open_past_24mths', 'tot_coll_amt', 'num_tl_120dpd_2m'
]
loan_data[count_cols] = loan_data[count_cols].fillna(0)
# Title (text field)
loan_data['title'] = loan_data['title'].fillna("Unknown")
# mths_since_recent_bc
loan_data['mths_since_recent_bc'] = loan_data['mths_since_recent_bc'].fillna(loan_data['mths_since_recent_bc'].median())
Variable 'earliest_cr_line'¶
loan_data['earliest_cr_line_date'] = pd.to_datetime(loan_data['earliest_cr_line'], format='mixed', errors='coerce')
# Extracts the date and the time from a string variable that is in a given format.
type(loan_data['earliest_cr_line_date'][0])
# Checks the datatype of a single element of a column.
pandas._libs.tslibs.timestamps.Timestamp
# Assume we are now in December 2020
loan_data['mths_since_earliest_cr_line'] = round(pd.to_numeric((pd.to_datetime('2020-12-01') - loan_data['earliest_cr_line_date'])/np.timedelta64(30, 'D')))
# We calculate the difference between two dates in months, turn it to numeric datatype and round it.
# We save the result in a new variable.
loan_data['mths_since_earliest_cr_line'].describe()
# Shows some descriptive statisics for the values of a column.
# Dates from 1969 and before are not being converted well, i.e., they have become 2069 and similar,
# and negative differences are being calculated.
count 2.260701e+06 mean 2.553759e+02 std 9.554257e+01 min 6.200000e+01 25% 1.900000e+02 50% 2.390000e+02 75% 3.050000e+02 max 1.068000e+03 Name: mths_since_earliest_cr_line, dtype: float64
loan_data['earliest_cr_line_date'].sample(5)
405485 2000-01-01 1230975 1995-12-01 586343 2006-08-01 290012 2006-02-01 865494 1998-02-01 Name: earliest_cr_line_date, dtype: datetime64[ns]
For the following features, the missing rates are all very low (under ~0.1% for most), so we can confidently impute them using simple and fast methods without significant risk of bias or information loss. Here's a breakdown with recommended strategies:¶
*Date-related fields:* last_pymnt_d, last_credit_pull_d, earliest_cr_line_date, earliest_cr_line; use Mode or Most Recent Date (Dates are usually month/year — very low missing rate, fill with most frequent or recent).
*Ratios / Percentages:* revol_util, dti; use Median (Continuous; median is robust to outliers).
*Credit history indicators:* pub_rec_bankruptcies, collections_12_mths_ex_med, chargeoff_within_12_mths, tax_liens, pub_rec, delinq_2yrs, delinq_amnt, acc_now_delinq; fill with 0 (Missing likely means "none" (very common assumption in credit data).
*Credit activity counts:* open_acc, total_acc, inq_last_6mths, mths_since_earliest_cr_line; use Median or 0 (Continuous, stable values)
*Income:* annual_inc; use Median (Rarely missing, median better than mean due to skew).
*Zip code:* zip_code; use Mode (Categorical; most common zip is fine)
# 1. Date columns - fill with most frequent date or a placeholder (e.g., 'Jan-2019')
date_cols = ['last_pymnt_d', 'last_credit_pull_d', 'earliest_cr_line']
for col in date_cols:
most_common_date = loan_data[col].mode()[0]
loan_data[col] = loan_data[col].fillna(most_common_date)
# 2. Ratios: Fill with median
loan_data['revol_util'] = loan_data['revol_util'].fillna(loan_data['revol_util'].median())
loan_data['dti'] = loan_data['dti'].fillna(loan_data['dti'].median())
# 3. Credit public records: Assume 0 means no record
zero_fill_cols = [
'pub_rec_bankruptcies', 'chargeoff_within_12_mths', 'collections_12_mths_ex_med',
'tax_liens', 'pub_rec', 'delinq_2yrs', 'delinq_amnt', 'acc_now_delinq'
]
loan_data[zero_fill_cols] = loan_data[zero_fill_cols].fillna(0)
# 4. Credit counts: fill with median
count_cols = [
'open_acc', 'total_acc', 'inq_last_6mths', 'mths_since_earliest_cr_line'
]
loan_data[count_cols] = loan_data[count_cols].fillna(loan_data[count_cols].median())
# 5. Annual income
loan_data['annual_inc'] = loan_data['annual_inc'].fillna(loan_data['annual_inc'].median())
# 6. Zip code
loan_data['zip_code'] = loan_data['zip_code'].fillna(loan_data['zip_code'].mode()[0])
Variables 'last_pymnt_d' & 'last_credit_pull_d'¶
# 'last_credit_pull_d' and 'last_pymnt_d' are date features that give temporal insight into borrower behavior and loan servicing
# Creating time-based features like “months since last payment” or “months since last credit pull” can really help the credit risk model
# understand borrower behavior better.
# Convert date columns to datetime if they aren't already
loan_data['last_pymnt_d'] = pd.to_datetime(loan_data['last_pymnt_d'], format='mixed', errors='coerce')
loan_data['last_credit_pull_d'] = pd.to_datetime(loan_data['last_credit_pull_d'], format='mixed', errors='coerce')
# Reference date — can use today or a fixed date (e.g., end of data collection period)
reference_date = pd.to_datetime("2020-12-31") # Replace with appropriate date based on your dataset
# Create new features: months since last payment and last credit pull
loan_data['months_since_last_pymnt'] = round(pd.to_numeric((reference_date - loan_data['last_pymnt_d']) / np.timedelta64(30, 'D')))
loan_data['months_since_last_credit_pull'] = round(pd.to_numeric((reference_date - loan_data['last_credit_pull_d']) / np.timedelta64(30, 'D')))
For the following features, the missing rate is extremely low (~0.0015%), so imputation is safe and won't significantly affect our model. Here’s a breakdown and recommended strategies:¶
Categorical Flags / IDs: *'disbursement_method', 'hardship_flag', 'pymnt_plan', 'application_type', 'verification_status', 'home_ownership', 'initial_list_status', 'term', 'purpose', 'addr_state', 'sub_grade', 'grade', 'policy_code'; Use Mode* method; These are categorical. Use the most frequent (mode) value.
Interest & Loan Terms: *'int_rate', 'installment', 'term_int', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv'; Use Median*; Numeric and continuous. Median is robust to outliers.
Dates: *'issue_d', 'issue_d_date', 'mths_since_issue_d'; Use Most frequent date or calculated from other columns*; Ensure consistency, or recalculate if redundant.
FICO Scores: *'fico_range_low', 'fico_range_high', 'last_fico_range_low', 'last_fico_range_high'; Use Median*; Continuous, impute with median.
Payment-related: *'last_pymnt_amnt', 'collection_recovery_fee', 'recoveries', 'total_rec_late_fee', 'total_rec_int', 'total_rec_prncp', 'total_pymnt_inv', 'total_pymnt', 'out_prncp_inv', 'out_prncp', 'revol_bal'; Use Median or 0*; If monetary, median; if possibly "no payment", then 0.
Unnecessary / Deprecated: *'url'; Use Drop method*; Not useful for modeling — unique URL per loan.
Debt Settlement: *'debt_settlement_flag'; Use Mode*; Often 'N' or 'None'; treat as categorical.
# Categorical columns to fill with mode
cat_cols = [
'disbursement_method', 'hardship_flag', 'pymnt_plan', 'application_type', 'verification_status',
'home_ownership', 'initial_list_status', 'term', 'purpose', 'addr_state',
'sub_grade', 'grade', 'policy_code', 'debt_settlement_flag'
]
for col in cat_cols:
loan_data[col] = loan_data[col].fillna(loan_data[col].mode()[0])
# Numeric columns to fill with median
num_cols = [
'int_rate', 'installment', 'term_int', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv',
'fico_range_low', 'fico_range_high', 'last_fico_range_low', 'last_fico_range_high',
'last_pymnt_amnt', 'collection_recovery_fee', 'recoveries', 'total_rec_late_fee',
'total_rec_int', 'total_rec_prncp', 'total_pymnt_inv', 'total_pymnt', 'out_prncp_inv',
'out_prncp', 'revol_bal'
]
loan_data[num_cols] = loan_data[num_cols].fillna(loan_data[num_cols].median())
# Date columns - fill with mode
date_cols = ['issue_d', 'issue_d_date', 'mths_since_issue_d']
for col in date_cols:
loan_data[col] = loan_data[col].fillna(loan_data[col].mode()[0])
# Drop 'url' if not used
loan_data = loan_data.drop(columns=['url'])
loan_status (0.0015% missing)¶
Meaning: Current status of the loan (e.g., 'Fully Paid', 'Charged Off', 'Current', etc.).
Type: Categorical — often the target variable in credit risk models!
Strategy:
- ✅ If you are modeling loan status as the target, you should drop those rows (since we don’t want to guess the target).
- 🚫 Avoid filling with mode unless you're doing something like survival analysis or lifetime modeling.
Verify if there are remaining missing values¶
# Checking the percentage of the missing values for each category
missing_percent = loan_data.isnull().mean() * 100
missing_percent = missing_percent[missing_percent > 0].sort_values(ascending=False)
# Display the result
print(missing_percent)
loan_status 0.00146 dtype: float64
Now there are no missing values.
# Display a sample of 5 entries.
loan_data.sample(5)
| id | loan_amnt | funded_amnt | funded_amnt_inv | term | int_rate | installment | grade | sub_grade | home_ownership | annual_inc | verification_status | issue_d | loan_status | pymnt_plan | purpose | title | zip_code | addr_state | dti | delinq_2yrs | earliest_cr_line | fico_range_low | fico_range_high | inq_last_6mths | mths_since_last_delinq | mths_since_last_record | open_acc | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | out_prncp | out_prncp_inv | total_pymnt | total_pymnt_inv | total_rec_prncp | total_rec_int | total_rec_late_fee | recoveries | collection_recovery_fee | last_pymnt_d | last_pymnt_amnt | next_pymnt_d | last_credit_pull_d | last_fico_range_high | last_fico_range_low | collections_12_mths_ex_med | mths_since_last_major_derog | policy_code | application_type | acc_now_delinq | tot_coll_amt | tot_cur_bal | open_acc_6m | open_act_il | open_il_12m | open_il_24m | mths_since_rcnt_il | total_bal_il | il_util | open_rv_12m | open_rv_24m | max_bal_bc | all_util | total_rev_hi_lim | inq_fi | total_cu_tl | inq_last_12m | acc_open_past_24mths | avg_cur_bal | bc_open_to_buy | bc_util | chargeoff_within_12_mths | delinq_amnt | mo_sin_old_il_acct | mo_sin_old_rev_tl_op | mo_sin_rcnt_rev_tl_op | mo_sin_rcnt_tl | mort_acc | mths_since_recent_bc | mths_since_recent_bc_dlq | mths_since_recent_inq | mths_since_recent_revol_delinq | num_accts_ever_120_pd | num_actv_bc_tl | num_actv_rev_tl | num_bc_sats | num_bc_tl | num_il_tl | num_op_rev_tl | num_rev_accts | num_rev_tl_bal_gt_0 | num_sats | num_tl_120dpd_2m | num_tl_30dpd | num_tl_90g_dpd_24m | num_tl_op_past_12m | pct_tl_nvr_dlq | percent_bc_gt_75 | pub_rec_bankruptcies | tax_liens | tot_hi_cred_lim | total_bal_ex_mort | total_bc_limit | total_il_high_credit_limit | hardship_flag | disbursement_method | debt_settlement_flag | emp_length_int | term_int | issue_d_date | mths_since_issue_d | target | earliest_cr_line_date | mths_since_earliest_cr_line | months_since_last_pymnt | months_since_last_credit_pull | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 963330 | 106165131 | 18000.0 | 18000.0 | 18000.0 | 60 months | 22.74 | 504.75 | E | E1 | RENT | 108000.0 | Verified | Apr-2017 | Current | n | debt_consolidation | Debt consolidation | 936xx | CA | 15.31 | 1.0 | Aug-1994 | 660.0 | 664.0 | 1.0 | 2.0 | 999.0 | 18.0 | 0.0 | 12074.0 | 37.4 | 37.0 | w | 13337.99 | 13337.99 | 11586.510000 | 11586.51 | 4662.01 | 6924.50 | 0.0 | 0.0 | 0.0 | 2019-03-01 | 504.75 | Apr-2019 | 2019-03-01 | 599.0 | 595.0 | 0.0 | 2.0 | 1.0 | Individual | 1.0 | 0.0 | 37135.0 | 1.0 | 2.0 | 0.0 | 1.0 | 21.0 | 25061.0 | 75.0 | 2.0 | 4.0 | 1250.0 | 57.0 | 32300.0 | 0.0 | 0.0 | 3.0 | 5.0 | 2184.0 | 4042.0 | 39.7 | 0.0 | 0.0 | 179.0 | 272.0 | 5.0 | 5.0 | 1.0 | 41.0 | 999.0 | 3.0 | 999.0 | 5.0 | 3.0 | 11.0 | 4.0 | 8.0 | 12.0 | 16.0 | 24.0 | 11.0 | 18.0 | 1.0 | 0.0 | 1.0 | 2.0 | 83.8 | 25.0 | 0.0 | 0.0 | 65700.0 | 37135.0 | 6700.0 | 33400.0 | N | Cash | N | 10.0 | 60.0 | 2017-04-01 | 45.0 | 0 | 1994-08-01 | 321.0 | 22.0 | 22.0 |
| 260707 | 51006181 | 20000.0 | 20000.0 | 20000.0 | 60 months | 15.61 | 482.23 | D | D1 | MORTGAGE | 92500.0 | Verified | Jun-2015 | Current | n | debt_consolidation | Debt consolidation | 920xx | CA | 13.40 | 2.0 | Jul-1999 | 665.0 | 669.0 | 2.0 | 15.0 | 999.0 | 9.0 | 0.0 | 24710.0 | 90.5 | 25.0 | f | 6532.82 | 6532.82 | 21683.010000 | 21683.01 | 13467.18 | 8215.83 | 0.0 | 0.0 | 0.0 | 2019-03-01 | 482.23 | Apr-2019 | 2019-03-01 | 699.0 | 695.0 | 0.0 | 15.0 | 1.0 | Individual | 0.0 | 0.0 | 279997.0 | 0.0 | 0.0 | 0.0 | 0.0 | 999.0 | 0.0 | 72.0 | 0.0 | 0.0 | 0.0 | 58.0 | 27300.0 | 0.0 | 0.0 | 0.0 | 4.0 | 31111.0 | 646.0 | 97.1 | 0.0 | 0.0 | 183.0 | 190.0 | 4.0 | 4.0 | 1.0 | 4.0 | 27.0 | 4.0 | 27.0 | 3.0 | 5.0 | 7.0 | 5.0 | 13.0 | 8.0 | 7.0 | 16.0 | 7.0 | 9.0 | 0.0 | 0.0 | 2.0 | 4.0 | 79.2 | 100.0 | 0.0 | 0.0 | 333252.0 | 25337.0 | 22300.0 | 852.0 | N | Cash | N | 10.0 | 60.0 | 2015-06-01 | 67.0 | 0 | 1999-07-01 | 261.0 | 22.0 | 22.0 |
| 1895814 | 1825127 | 5600.0 | 5600.0 | 5575.0 | 36 months | 12.12 | 186.33 | B | B3 | RENT | 48000.0 | Verified | Nov-2012 | Fully Paid | n | credit_card | Card Refinance | 900xx | CA | 22.09 | 0.0 | Mar-2004 | 680.0 | 684.0 | 0.0 | 999.0 | 999.0 | 9.0 | 0.0 | 16158.0 | 62.1 | 21.0 | f | 0.00 | 0.00 | 6683.299998 | 6653.46 | 5600.00 | 1083.30 | 0.0 | 0.0 | 0.0 | 2015-06-01 | 1097.89 | No Payment Due | 2017-07-01 | 509.0 | 505.0 | 0.0 | 999.0 | 1.0 | Individual | 0.0 | 0.0 | 22808.0 | 0.0 | 0.0 | 0.0 | 0.0 | 999.0 | 0.0 | 72.0 | 0.0 | 0.0 | 0.0 | 58.0 | 26000.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2534.0 | 512.0 | 96.5 | 0.0 | 0.0 | 105.0 | 104.0 | 44.0 | 18.0 | 0.0 | 44.0 | 999.0 | 999.0 | 999.0 | 0.0 | 3.0 | 6.0 | 3.0 | 5.0 | 11.0 | 6.0 | 9.0 | 6.0 | 9.0 | 0.0 | 0.0 | 0.0 | 0.0 | 100.0 | 100.0 | 0.0 | 0.0 | 36930.0 | 22808.0 | 14600.0 | 6592.0 | N | Cash | N | 10.0 | 36.0 | 2012-11-01 | 98.0 | 0 | 2004-03-01 | 204.0 | 68.0 | 43.0 |
| 2103815 | 121429872 | 14000.0 | 14000.0 | 14000.0 | 60 months | 16.02 | 340.61 | C | C5 | MORTGAGE | 51600.0 | Verified | Nov-2017 | Current | n | debt_consolidation | Debt consolidation | 945xx | CA | 32.84 | 0.0 | Sep-1981 | 685.0 | 689.0 | 1.0 | 999.0 | 999.0 | 16.0 | 0.0 | 31112.0 | 53.7 | 33.0 | w | 11278.37 | 11278.37 | 5437.300000 | 5437.30 | 2721.63 | 2715.67 | 0.0 | 0.0 | 0.0 | 2019-03-01 | 340.61 | Apr-2019 | 2019-03-01 | 654.0 | 650.0 | 0.0 | 999.0 | 1.0 | Joint App | 0.0 | 668.0 | 434780.0 | 2.0 | 2.0 | 0.0 | 3.0 | 17.0 | 24105.0 | 83.0 | 1.0 | 3.0 | 5831.0 | 63.0 | 57900.0 | 2.0 | 0.0 | 2.0 | 10.0 | 27174.0 | 9703.0 | 69.6 | 0.0 | 0.0 | 149.0 | 433.0 | 6.0 | 3.0 | 6.0 | 24.0 | 999.0 | 5.0 | 999.0 | 0.0 | 7.0 | 11.0 | 8.0 | 11.0 | 7.0 | 13.0 | 20.0 | 11.0 | 16.0 | 0.0 | 0.0 | 0.0 | 4.0 | 100.0 | 62.5 | 0.0 | 0.0 | 467088.0 | 55217.0 | 31900.0 | 29188.0 | N | Cash | N | 10.0 | 60.0 | 2017-11-01 | 38.0 | 0 | 1981-09-01 | 478.0 | 22.0 | 22.0 |
| 71343 | 63357608 | 7500.0 | 7500.0 | 7500.0 | 36 months | 11.49 | 247.29 | B | B5 | RENT | 37000.0 | Verified | Nov-2015 | Charged Off | n | credit_card | Credit card refinancing | 852xx | AZ | 18.07 | 0.0 | Aug-2002 | 675.0 | 679.0 | 1.0 | 66.0 | 999.0 | 14.0 | 0.0 | 5577.0 | 90.0 | 26.0 | w | 0.00 | 0.00 | 6182.620000 | 6182.62 | 4928.56 | 1254.06 | 0.0 | 0.0 | 0.0 | 2017-12-01 | 247.29 | No Payment Due | 2018-09-01 | 519.0 | 515.0 | 0.0 | 68.0 | 1.0 | Individual | 0.0 | 0.0 | 52923.0 | 0.0 | 0.0 | 0.0 | 0.0 | 999.0 | 0.0 | 72.0 | 0.0 | 0.0 | 0.0 | 58.0 | 6200.0 | 0.0 | 0.0 | 0.0 | 5.0 | 4071.0 | 454.0 | 91.7 | 0.0 | 0.0 | 159.0 | 107.0 | 3.0 | 3.0 | 0.0 | 3.0 | 999.0 | 3.0 | 999.0 | 10.0 | 2.0 | 3.0 | 3.0 | 3.0 | 21.0 | 4.0 | 5.0 | 3.0 | 14.0 | 0.0 | 0.0 | 0.0 | 2.0 | 57.7 | 100.0 | 0.0 | 0.0 | 55298.0 | 52923.0 | 5500.0 | 49098.0 | N | Cash | N | 5.0 | 36.0 | 2015-11-01 | 62.0 | 1 | 2002-08-01 | 223.0 | 38.0 | 28.0 |
# Drop the column 'target'
loan_data = loan_data.drop(columns=['target'])
loan_data.shape
(2260701, 118)
The final version of the dataset is composed of 2260668 rows and 118 columns.¶
II. PD model (Probability of Default)¶
Data preparation¶
Dependent Variable — key target variable for credit risk modeling¶
The 'loan_status' feature is a key target variable for credit risk modeling.
loan_data['loan_status'].unique()
# Displays unique values of 'loan_status' column.
array(['Fully Paid', 'Current', 'Charged Off', 'In Grace Period',
'Late (31-120 days)', 'Late (16-30 days)', 'Default', nan,
'Does not meet the credit policy. Status:Fully Paid',
'Does not meet the credit policy. Status:Charged Off'],
dtype=object)
- Fully Paid: The borrower repaid the loan in full. ✅ Good outcome.
- Current: The borrower is still making payments and is on schedule. Ongoing loan.
- Charged Off: The lender has written off the loan as a loss after severe delinquency. ❌ Bad outcome.
- Late (31-120 days): Payments are overdue by 31 to 120 days. High-risk. May end in default or charge-off.
- In Grace Period: Recently missed payment, but within an acceptable grace period (usually <15 days). Moderate risk.
- Late (16-30 days): Slightly overdue, may still recover. Warning sign.
- Does not meet the credit policy. Status:Fully Paid: Was funded outside of normal policy but fully paid. You can treat like “Fully Paid”.
- Does not meet the credit policy. Status:Charged Off: Outside policy and ended in loss. You can treat like “Charged Off”.
- Default: Officially declared as defaulted. Worst-case scenario. May overlap with Charged Off.
loan_data['loan_status'].value_counts()
# Calculates the number of observations for each unique value of a variable.
loan_status Fully Paid 1076751 Current 878317 Charged Off 268559 Late (31-120 days) 21467 In Grace Period 8436 Late (16-30 days) 4349 Does not meet the credit policy. Status:Fully Paid 1988 Does not meet the credit policy. Status:Charged Off 761 Default 40 Name: count, dtype: int64
loan_data['loan_status'].value_counts() / loan_data['loan_status'].count() *100
# We divide the number of observations for each unique value of a variable by the total number of observations.
# Thus, we get the proportion of observations for each unique value of a variable.
loan_status Fully Paid 47.629771 Current 38.852100 Charged Off 11.879630 Late (31-120 days) 0.949587 In Grace Period 0.373164 Late (16-30 days) 0.192377 Does not meet the credit policy. Status:Fully Paid 0.087939 Does not meet the credit policy. Status:Charged Off 0.033663 Default 0.001769 Name: count, dtype: float64
Typical Modeling Strategy: Grouping Loan Status¶
To build a binary credit risk model (e.g., Will the borrower default or not?), we will group into "Good" vs. "Bad" loans:
✅ Good:
- Fully Paid
- Does not meet the credit policy. Status:Fully Paid
❌Bad:
- Charged Off
- Default
- Late (31-120 days)
- Late (16-30 days)
- Does not meet the credit policy. Status:Charged Off
⚠️ Special Cases:
- Current: The borrower is up to date on payments. However, the loan has not reached maturity — so we don’t yet know if it will default or be fully paid.
- In Grace Period: The borrower has missed a payment, but is still within the lender’s allowed grace period (typically 15 days). This could still go either way — recovery or default.
✅ Best Practice for Credit Risk Modeling: Treat “Current” and “In Grace Period” as unknown cases and exclude them from model training. Use them later only for prediction/evaluation if needed.
✅ Treating Them as Unknown = Conservative, Trustworthy, and Realistic:
- These loans haven’t finished their life cycle. Some will default later, others will be fully paid — you just don’t know yet.
- Including them during training will create label noise and weaken your model's ability to differentiate true risk signals.
📈 Clean Binary Classification = Better Interpretability:
- You can clearly define:
- Good (0): Fully Paid
- Bad (1): Charged Off, Default, and potentially Late
- Train a robust binary classifier, then apply it to Current loans as future predictions.
For the two statuses:¶
- 'Does not meet the credit policy. Status:Fully Paid'
- 'Does not meet the credit policy. Status:Charged Off'
These are special cases flagged by Lending Club: They indicate that the loan didn’t meet Lending Club’s internal credit policy at the time of application, but was still funded (usually manually by investors or for internal testing).
However, they do have known final outcomes:
- Some ended up Fully Paid.
- Others ended up Charged Off.
These cases may not be representative of the standard population:
- Bypassed the normal screening process.
- Could be riskier or manually approved based on different criteria.
- Might bias the model slightly if not handled carefully.
Recommended Options:
- Exclude them for maximum model purity.
- The model will reflect then only standard Lending Club loan approval logic.
# Step 1: Keep only loans with known outcomes
loan_data_clean = loan_data[loan_data['loan_status'].isin(['Fully Paid','Charged Off','Default',
'Late (31-120 days)','Late (16-30 days)'])]
# copy the datafile
loan_data_clean = loan_data_clean.copy()
# Good/ Bad Definition
loan_data_clean['good_bad'] = np.where(
loan_data_clean['loan_status'].isin(['Charged Off', 'Default',
'Late (31-120 days)', 'Late (16-30 days)']), 1, 0)
# We create a new variable that has the value of '0' if a condition is met, and the value of '1' if it is not met.
# shape of the final cleaned dataset
loan_data_clean.shape
(1371166, 119)
The cleaned dataset that will be used for the training of the model is composed of 1371166 rows.
Splitting Data¶
from sklearn.model_selection import train_test_split
# Imports the libraries we need.
loan_data_inputs_train, loan_data_inputs_test, loan_data_targets_train, loan_data_targets_test = train_test_split(
loan_data_clean.drop('good_bad', axis = 1), loan_data_clean['good_bad'], test_size = 0.2, random_state = 42)
# We split two dataframes with inputs and targets, each into a train and test dataframe, and store them in variables.
# This time we set the size of the test dataset to be 20%.
# Respectively, the size of the train dataset becomes 80%.
# We also set a specific random state.
# This would allow us to perform the exact same split multimple times.
# This means, to assign the exact same observations to the train and test datasets.
loan_data_inputs_train.shape
# Displays the size of the dataframe.
(1096932, 118)
loan_data_targets_train.shape
# Displays the size of the dataframe.
(1096932,)
loan_data_inputs_test.shape
# Displays the size of the dataframe.
(274234, 118)
loan_data_targets_test.shape
# Displays the size of the dataframe.
(274234,)
Save inputs data & targets data for training & testing¶
#####
#df_inputs_prepr = loan_data_inputs_train
#df_targets_prepr = loan_data_targets_train
#####
df_inputs_prepr = loan_data_inputs_test
df_targets_prepr = loan_data_targets_test
A. Preprocessing Discrete Variables¶
The Weight of Evidence (WoE) is a powerful technique, especially in credit scoring and risk modeling, for transforming categorical or discrete variables into a numerical format that’s both predictive and interpretable.
What is Weight of Evidence (WoE)?¶
Weight of Evidence transforms categorical or binned continuous variables into a numeric scale that measures how strongly a variable predicts the target (usually binary: good vs. bad loan).
- It's widely used in credit scoring because:
- It helps handle categorical variables with many levels.
- It ensures monotonic relationship with the target variable.
- It works well with logistic regression models.
How to Use WoE with Discrete (Categorical) Variables?¶
- Group the variable’s categories (or bin if continuous).
- Count the number of goods and bads in each group.
- Calculate WoE for each group using the formula above.
- Replace each category in the original variable with its WoE value.
Why Use WoE?¶
- Ensures variables have a predictive relationship with the target.
- Useful for interpretable models like scorecards or logistic regression.
- Helps identify information value (IV) — a metric to judge a variable’s predictive power.
Bonus: Use WoE Together with Information Value (IV)¶
- IV < 0.02 → Not predictive.
- 0.02–0.1 → Weak.
- 0.1–0.3 → Medium.
- 0.3–0.5 → Strong
- higher than 0.5 → Suspiciously powerful (check for data leakage)
# WoE function for discrete unordered variables
def woe_discrete(df, discrete_variabe_name, good_bad_variable_df):
# Concatenates two dataframes along the columns.
df = pd.concat([df[discrete_variabe_name], good_bad_variable_df], axis = 1)
# Groups the data according to a criterion contained in one column.
# Does not turn the names of the values of the criterion as indexes.
# Aggregates the data in another column, using a selected function.
# In this specific case, we group by the column with index 0 and we aggregate the values of the column with index 1.
# More specifically, we count them.
# In other words, we count the values in the column with index 1 for each value of the column with index 0.
# We calculate then the mean of the values in the column with index 1 for each value of the column with index 0.
# And concatenate two dataframes along the columns.
df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
# Selects only columns with specific indexes.
df = df.iloc[:, [0, 1, 3]]
# Changes the names of the columns of a dataframe.
df.columns = [df.columns.values[0], 'n_obs', 'prop_good']
# We divide the values of one column by he values of another column and save the result in a new variable.
df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()
# We multiply the values of one column by he values of another column and save the result in a new variable.
df['n_good'] = df['prop_good'] * df['n_obs']
df['n_bad'] = (1 - df['prop_good']) * df['n_obs']
# We calculate the proportion of good and the proportion of bad observations
df['prop_n_good'] = df['n_good'] / df['n_good'].sum()
df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()
# We take the natural logarithm of a variable and save the result in a nex variable.
# WoE = Weight of Evidence
df['WoE'] = np.log(df['prop_n_good'] / df['prop_n_bad'])
# Sorts a dataframe by the values of a given column.
df = df.sort_values(['WoE'])
# We reset the index of a dataframe and overwrite it.
df = df.reset_index(drop = True)
# We take the difference between two subsequent values of a column. Then, we take the absolute value of the result.
df['diff_prop_good'] = df['prop_good'].diff().abs()
# We take the difference between two subsequent values of a column. Then, we take the absolute value of the result.
df['diff_WoE'] = df['WoE'].diff().abs()
# We sum all values of a given column.
df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WoE']
df['IV'] = df['IV'].sum()
return df
# The function takes 3 arguments: a dataframe, a string, and a dataframe. The function returns a dataframe as a result.
Preprocessing Discrete Variables: Visualizing Results¶
# Imports the libraries we need.
sns.set()
# We set the default style of the graphs to the seaborn style.
# Below we define a function that takes 2 arguments: a dataframe and a number.
# The number parameter has a default value of 0.
# This means that if we call the function and omit the number parameter, it will be executed with it having a value of 0.
# The function displays a graph.
def plot_by_woe(df_WoE, rotation_of_x_axis_labels = 0):
x = np.array(df_WoE.iloc[:, 0].apply(str))
# Turns the values of the column with index 0 to strings, makes an array from these strings, and passes it to variable x.
y = df_WoE['WoE']
# Selects a column with label 'WoE' and passes it to variable y.
plt.figure(figsize=(18, 6))
# Sets the graph size to width 18 x height 6.
plt.plot(x, y, marker = 'o', linestyle = '--', color = 'k')
# Plots the datapoints with coordiantes variable x on the x-axis and variable y on the y-axis.
# Sets the marker for each datapoint to a circle, the style line between the points to dashed, and the color to black.
plt.xlabel(df_WoE.columns[0])
# Names the x-axis with the name of the column with index 0.
plt.ylabel('Weight of Evidence')
# Names the y-axis 'Weight of Evidence'.
plt.title(str('Weight of Evidence by ' + df_WoE.columns[0]))
# Names the grapth 'Weight of Evidence by ' the name of the column with index 0.
plt.xticks(rotation = rotation_of_x_axis_labels)
# Rotates the labels of the x-axis a predefined number of degrees.
List of the categorical variables in the dataset¶
# Check the list of the categorical features of the dataset
categorical_vars = loan_data_clean.select_dtypes(include=['object', 'category']).columns.tolist()
print(categorical_vars)
print()
print('Number of categorical variables : ',len(categorical_vars))
['id', 'term', 'grade', 'sub_grade', 'home_ownership', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'purpose', 'title', 'zip_code', 'addr_state', 'earliest_cr_line', 'initial_list_status', 'next_pymnt_d', 'application_type', 'hardship_flag', 'disbursement_method', 'debt_settlement_flag'] Number of categorical variables : 20
Certain categorical variables should be dropped from the training of the credit risk model:¶
- id : Unique identifier – drop it entirely
- emp_title: Very high cardinality – can be binned or dropped unless NLP is used
- title: Free text similar to purpose – often redundant/noisy
- zip_code: Contains geographic info, but only first 3 digits – may not generalize
- earliest_cr_line, 'earliest_cr_line_date': Date-type, not categorical — age of credit history is instead extracted from them.
- next_pymnt_d: Future payment date — may not be useful for initial risk prediction
- issue_d, 'issue_d_date': Loan issue date — convert to numeric loan age
- pymnt_plan: Usually 'n' (no) for nearly all loans – low variance, maybe drop
- 'term': Converted to another numerical variable 'term_int'
- 'last_pymnt_d': Date — convert to numeric age
- 'last_credit_pull_d': Date — convert to numeric age
- loan_status: Target variable — use only to define target, not as a feature
list_col_to_drop = ['id', 'title', 'zip_code', 'earliest_cr_line', 'earliest_cr_line_date', 'next_pymnt_d', 'issue_d',
'issue_d_date', 'pymnt_plan', 'term', 'last_pymnt_d', 'last_credit_pull_d', 'loan_status']
df_inputs_prepr = df_inputs_prepr.drop(columns = list_col_to_drop)
df_inputs_prepr.shape
(274234, 105)
The final vesrion of the training and test datasets are composed by 105 features.
# Check the list of the categorical features of the dataset
categorical_vars_final = df_inputs_prepr.select_dtypes(include=['object', 'category']).columns.tolist()
print(categorical_vars_final)
['grade', 'sub_grade', 'home_ownership', 'verification_status', 'purpose', 'addr_state', 'initial_list_status', 'application_type', 'hardship_flag', 'disbursement_method', 'debt_settlement_flag']
There area 11 categorical features.
# Unique values of 'grade' feature
df_inputs_prepr['grade'].unique()
array(['C', 'D', 'A', 'E', 'F', 'B', 'G'], dtype=object)
# Unique values of 'sub_grade' feature
df_inputs_prepr['sub_grade'].unique()
array(['C4', 'D2', 'C1', 'A5', 'E3', 'C3', 'C5', 'F4', 'E2', 'B2', 'A1',
'C2', 'D3', 'B1', 'B4', 'E5', 'B5', 'D1', 'B3', 'A4', 'A3', 'E1',
'D5', 'F5', 'F2', 'G4', 'A2', 'D4', 'F1', 'G1', 'G5', 'G2', 'G3',
'F3', 'E4'], dtype=object)
Variable 'grade'¶
Avoid using both 'grade' and 'sub_grade' together:
- Multicollinearity risk: grade is derived from sub_grade, so including both introduces strong correlation.
- It adds noise rather than new signal.
We choose to keep the variable 'grade' and to drop the variable 'sub_grade'.
# Drop the 'sub_grade' column.
df_inputs_prepr = df_inputs_prepr.drop(columns = ['sub_grade'])
# 'grade'
df_temp = woe_discrete(df_inputs_prepr, 'grade', df_targets_prepr)
# We execute the function we defined with the necessary arguments: a dataframe, a string, and a dataframe.
# We store the result in a dataframe.
df_temp
| grade | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A | 47247 | 0.068428 | 0.172287 | 3233.0 | 44014.0 | 0.054981 | 0.204306 | -1.312628 | NaN | NaN | 0.437964 |
| 1 | B | 79943 | 0.146042 | 0.291514 | 11675.0 | 68268.0 | 0.198548 | 0.316889 | -0.467522 | 0.077614 | 0.845106 | 0.437964 |
| 2 | C | 77916 | 0.241825 | 0.284122 | 18842.0 | 59074.0 | 0.320431 | 0.274212 | 0.155767 | 0.095783 | 0.623289 | 0.437964 |
| 3 | D | 41485 | 0.321827 | 0.151276 | 13351.0 | 28134.0 | 0.227050 | 0.130593 | 0.553082 | 0.080003 | 0.397315 | 0.437964 |
| 4 | E | 19227 | 0.402403 | 0.070112 | 7737.0 | 11490.0 | 0.131577 | 0.053335 | 0.903006 | 0.080576 | 0.349924 | 0.437964 |
| 5 | F | 6599 | 0.463252 | 0.024063 | 3057.0 | 3542.0 | 0.051988 | 0.016441 | 1.151212 | 0.060849 | 0.248206 | 0.437964 |
| 6 | G | 1817 | 0.499174 | 0.006626 | 907.0 | 910.0 | 0.015425 | 0.004224 | 1.295167 | 0.035922 | 0.143955 | 0.437964 |
plot_by_woe(df_temp)
# We execute the function we defined with the necessary arguments: a dataframe.
# We omit the number argument, which means the function will use its default value, 0.
df_var_dummies =[pd.get_dummies(df_inputs_prepr['grade'], prefix = 'grade', prefix_sep = ':')]
# We create dummy variables from original independent variables, and save them into a list.
# Note that we are using a particular naming convention for all variables: original variable name, colon, category name.
df_var_dummies = pd.concat(df_var_dummies, axis = 1)
# We concatenate the dummy variables and this turns them into a dataframe.
df_inputs_prepr = pd.concat([df_inputs_prepr, df_var_dummies], axis = 1)
# Concatenates two dataframes.
# Here we concatenate the dataframe with original data with the dataframe with dummy variables, along the columns.
df_inputs_prepr.head()
| loan_amnt | funded_amnt | funded_amnt_inv | int_rate | installment | grade | home_ownership | annual_inc | verification_status | purpose | addr_state | dti | delinq_2yrs | fico_range_low | fico_range_high | inq_last_6mths | mths_since_last_delinq | mths_since_last_record | open_acc | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | out_prncp | out_prncp_inv | total_pymnt | total_pymnt_inv | total_rec_prncp | total_rec_int | total_rec_late_fee | recoveries | collection_recovery_fee | last_pymnt_amnt | last_fico_range_high | last_fico_range_low | collections_12_mths_ex_med | mths_since_last_major_derog | policy_code | application_type | acc_now_delinq | tot_coll_amt | tot_cur_bal | open_acc_6m | open_act_il | open_il_12m | open_il_24m | mths_since_rcnt_il | total_bal_il | il_util | open_rv_12m | open_rv_24m | max_bal_bc | all_util | total_rev_hi_lim | inq_fi | total_cu_tl | inq_last_12m | acc_open_past_24mths | avg_cur_bal | bc_open_to_buy | bc_util | chargeoff_within_12_mths | delinq_amnt | mo_sin_old_il_acct | mo_sin_old_rev_tl_op | mo_sin_rcnt_rev_tl_op | mo_sin_rcnt_tl | mort_acc | mths_since_recent_bc | mths_since_recent_bc_dlq | mths_since_recent_inq | mths_since_recent_revol_delinq | num_accts_ever_120_pd | num_actv_bc_tl | num_actv_rev_tl | num_bc_sats | num_bc_tl | num_il_tl | num_op_rev_tl | num_rev_accts | num_rev_tl_bal_gt_0 | num_sats | num_tl_120dpd_2m | num_tl_30dpd | num_tl_90g_dpd_24m | num_tl_op_past_12m | pct_tl_nvr_dlq | percent_bc_gt_75 | pub_rec_bankruptcies | tax_liens | tot_hi_cred_lim | total_bal_ex_mort | total_bc_limit | total_il_high_credit_limit | hardship_flag | disbursement_method | debt_settlement_flag | emp_length_int | term_int | mths_since_issue_d | mths_since_earliest_cr_line | months_since_last_pymnt | months_since_last_credit_pull | grade:A | grade:B | grade:C | grade:D | grade:E | grade:F | grade:G | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 299291 | 12000.0 | 12000.0 | 12000.0 | 13.99 | 279.16 | C | OWN | 30000.0 | Source Verified | debt_consolidation | SD | 25.32 | 0.0 | 675.0 | 679.0 | 1.0 | 76.0 | 999.0 | 19.0 | 0.0 | 11405.0 | 60.3 | 35.0 | w | 0.0 | 0.0 | 13667.840000 | 13667.84 | 12000.00 | 1667.84 | 0.0 | 0.00 | 0.0000 | 10615.73 | 574.0 | 570.0 | 0.0 | 999.0 | 1.0 | Individual | 0.0 | 0.0 | 88510.0 | 0.0 | 0.0 | 0.0 | 0.0 | 999.0 | 0.0 | 72.0 | 0.0 | 0.0 | 0.0 | 58.0 | 18900.0 | 0.0 | 0.0 | 0.0 | 4.0 | 4658.0 | 1221.0 | 83.3 | 0.0 | 0.0 | 127.0 | 121.0 | 15.0 | 5.0 | 0.0 | 36.0 | 76.0 | 5.0 | 76.0 | 0.0 | 6.0 | 11.0 | 6.0 | 14.0 | 7.0 | 14.0 | 28.0 | 11.0 | 19.0 | 0.0 | 0.0 | 0.0 | 1.0 | 97.1 | 66.7 | 0.0 | 0.0 | 97351.0 | 88510.0 | 7300.0 | 78451.0 | N | Cash | N | 7.0 | 60.0 | 68.0 | 198.0 | 57.0 | 22.0 | False | False | True | False | False | False | False |
| 2099335 | 35000.0 | 35000.0 | 35000.0 | 18.06 | 889.92 | D | RENT | 140000.0 | Source Verified | debt_consolidation | CA | 20.49 | 0.0 | 695.0 | 699.0 | 2.0 | 999.0 | 999.0 | 12.0 | 0.0 | 30808.0 | 20.0 | 18.0 | w | 0.0 | 0.0 | 11582.320000 | 11582.32 | 3063.09 | 3986.04 | 0.0 | 4533.19 | 815.9742 | 889.92 | 574.0 | 570.0 | 0.0 | 999.0 | 1.0 | Individual | 0.0 | 0.0 | 91853.0 | 3.0 | 4.0 | 4.0 | 4.0 | 4.0 | 61045.0 | 87.0 | 2.0 | 5.0 | 11140.0 | 20.0 | 157000.0 | 0.0 | 2.0 | 4.0 | 9.0 | 7654.0 | 19625.0 | 20.0 | 0.0 | 0.0 | 93.0 | 72.0 | 3.0 | 3.0 | 0.0 | 3.0 | 999.0 | 3.0 | 999.0 | 0.0 | 8.0 | 8.0 | 8.0 | 8.0 | 9.0 | 8.0 | 9.0 | 8.0 | 12.0 | 0.0 | 0.0 | 0.0 | 6.0 | 100.0 | 0.0 | 0.0 | 0.0 | 227204.0 | 91853.0 | 157000.0 | 70204.0 | N | Cash | N | 2.0 | 60.0 | 38.0 | 132.0 | 30.0 | 24.0 | False | False | False | True | False | False | False |
| 113647 | 8400.0 | 8400.0 | 8400.0 | 12.29 | 280.17 | C | MORTGAGE | 70495.0 | Verified | other | GA | 16.04 | 0.0 | 660.0 | 664.0 | 1.0 | 44.0 | 999.0 | 19.0 | 0.0 | 16940.0 | 94.1 | 39.0 | w | 0.0 | 0.0 | 10008.154816 | 10008.15 | 8400.00 | 1608.15 | 0.0 | 0.00 | 0.0000 | 196.31 | 709.0 | 705.0 | 0.0 | 45.0 | 1.0 | Individual | 0.0 | 79.0 | 145252.0 | 0.0 | 0.0 | 0.0 | 0.0 | 999.0 | 0.0 | 72.0 | 0.0 | 0.0 | 0.0 | 58.0 | 18000.0 | 0.0 | 0.0 | 0.0 | 8.0 | 7645.0 | 207.0 | 98.0 | 0.0 | 0.0 | 98.0 | 101.0 | 3.0 | 3.0 | 0.0 | 16.0 | 44.0 | 3.0 | 44.0 | 2.0 | 2.0 | 5.0 | 2.0 | 3.0 | 28.0 | 5.0 | 10.0 | 5.0 | 19.0 | 0.0 | 0.0 | 0.0 | 2.0 | 92.1 | 100.0 | 0.0 | 0.0 | 142744.0 | 145252.0 | 10500.0 | 124664.0 | N | Cash | N | 2.0 | 36.0 | 63.0 | 165.0 | 28.0 | 28.0 | False | False | True | False | False | False | False |
| 180785 | 18000.0 | 18000.0 | 18000.0 | 7.89 | 563.15 | A | RENT | 68000.0 | Not Verified | debt_consolidation | NV | 11.12 | 0.0 | 705.0 | 709.0 | 0.0 | 81.0 | 999.0 | 6.0 | 0.0 | 7540.0 | 55.4 | 21.0 | w | 0.0 | 0.0 | 20235.797565 | 20235.80 | 18000.00 | 2235.80 | 0.0 | 0.00 | 0.0000 | 2230.78 | 714.0 | 710.0 | 0.0 | 999.0 | 1.0 | Individual | 0.0 | 0.0 | 33713.0 | 0.0 | 0.0 | 0.0 | 0.0 | 999.0 | 0.0 | 72.0 | 0.0 | 0.0 | 0.0 | 58.0 | 13600.0 | 0.0 | 0.0 | 0.0 | 2.0 | 5619.0 | 4353.0 | 63.1 | 0.0 | 0.0 | 94.0 | 96.0 | 9.0 | 9.0 | 0.0 | 9.0 | 81.0 | 21.0 | 81.0 | 0.0 | 3.0 | 4.0 | 3.0 | 5.0 | 14.0 | 4.0 | 7.0 | 4.0 | 6.0 | 0.0 | 0.0 | 0.0 | 1.0 | 94.7 | 33.3 | 0.0 | 0.0 | 46695.0 | 33713.0 | 11800.0 | 33095.0 | N | Cash | N | 0.0 | 36.0 | 65.0 | 162.0 | 32.0 | 27.0 | True | False | False | False | False | False | False |
| 1805875 | 12000.0 | 12000.0 | 12000.0 | 21.60 | 455.81 | E | RENT | 165000.0 | Not Verified | credit_card | NY | 7.53 | 1.0 | 680.0 | 684.0 | 3.0 | 10.0 | 999.0 | 12.0 | 0.0 | 7883.0 | 47.2 | 25.0 | f | 0.0 | 0.0 | 13956.129946 | 13956.13 | 12000.00 | 1956.13 | 0.0 | 0.00 | 0.0000 | 9854.71 | 749.0 | 745.0 | 0.0 | 10.0 | 1.0 | Individual | 0.0 | 0.0 | 26652.0 | 0.0 | 0.0 | 0.0 | 0.0 | 999.0 | 0.0 | 72.0 | 0.0 | 0.0 | 0.0 | 58.0 | 16700.0 | 0.0 | 0.0 | 0.0 | 10.0 | 2221.0 | 327.0 | 85.1 | 0.0 | 0.0 | 22.0 | 162.0 | 2.0 | 2.0 | 0.0 | 2.0 | 34.0 | 0.0 | 34.0 | 1.0 | 2.0 | 7.0 | 2.0 | 11.0 | 2.0 | 9.0 | 22.0 | 7.0 | 12.0 | 0.0 | 0.0 | 1.0 | 5.0 | 86.0 | 100.0 | 0.0 | 0.0 | 42613.0 | 26652.0 | 2200.0 | 16500.0 | N | Cash | N | 2.0 | 36.0 | 88.0 | 251.0 | 79.0 | 22.0 | False | False | False | False | True | False | False |
Variable 'home_ownership'¶
# 'home_ownership'
df_temp = woe_discrete(df_inputs_prepr, 'home_ownership', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| home_ownership | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NONE | 12 | 0.166667 | 0.000044 | 2.0 | 10.0 | 0.000034 | 0.000046 | -0.310968 | NaN | NaN | 0.031033 |
| 1 | MORTGAGE | 135489 | 0.185329 | 0.494063 | 25110.0 | 110379.0 | 0.427026 | 0.512361 | -0.182184 | 0.018662 | 0.128784 | 0.031033 |
| 2 | OWN | 29698 | 0.223012 | 0.108294 | 6623.0 | 23075.0 | 0.112632 | 0.107110 | 0.050268 | 0.037683 | 0.232452 | 0.031033 |
| 3 | RENT | 108950 | 0.248233 | 0.397288 | 27045.0 | 81905.0 | 0.459933 | 0.380190 | 0.190412 | 0.025221 | 0.140143 | 0.031033 |
| 4 | OTHER | 36 | 0.250000 | 0.000131 | 9.0 | 27.0 | 0.000153 | 0.000125 | 0.199857 | 0.001767 | 0.009446 | 0.031033 |
| 5 | ANY | 49 | 0.265306 | 0.000179 | 13.0 | 36.0 | 0.000221 | 0.000167 | 0.279900 | 0.015306 | 0.080043 | 0.031033 |
plot_by_woe(df_temp)
# We plot the weight of evidence values.
df_var_dummies = [pd.get_dummies(df_inputs_prepr['home_ownership'], prefix = 'home_ownership', prefix_sep = ':')]
# We create dummy variables from original independent variables, and save them into a list.
# Note that we are using a particular naming convention for all variables: original variable name, colon, category name.
df_var_dummies = pd.concat(df_var_dummies, axis = 1)
# We concatenate the dummy variables and this turns them into a dataframe.
# There are many categories with very few observations and many categories with very different "good" %.
# Therefore, we create a new discrete variable where we combine some of the categories.
# 'OTHERS' and 'NONE' are riskiest but are very few. 'RENT' is the next riskiest.
# 'ANY' are least risky but are too few. Conceptually, they belong to the same category. Also, their inclusion would not change anything.
# We combine them in one category, 'RENT_OTHER_NONE_ANY'.
# We end up with 3 categories: 'RENT_OTHER_NONE_ANY', 'OWN', 'MORTGAGE'.
df_var_dummies['home_ownership:RENT_OTHER_NONE_ANY'] = sum([df_var_dummies['home_ownership:RENT'], df_var_dummies['home_ownership:OTHER'],
df_var_dummies['home_ownership:NONE'], df_var_dummies['home_ownership:ANY']])
# 'RENT_OTHER_NONE_ANY' will be the reference category.
df_var_dummies = df_var_dummies.drop(columns = ['home_ownership:RENT', 'home_ownership:OTHER', 'home_ownership:NONE', 'home_ownership:ANY'])
# Drop the dummy variables that are grouped together
df_inputs_prepr = pd.concat([df_inputs_prepr, df_var_dummies], axis = 1)
# Concatenates two dataframes.
# Here we concatenate the dataframe with original data with the dataframe with dummy variables, along the columns.
Variable 'addr_state'¶
# 'addr_state'
df_inputs_prepr['addr_state'].unique()
array(['SD', 'CA', 'GA', 'NV', 'NY', 'PA', 'MD', 'MT', 'FL', 'CO', 'NJ',
'CT', 'TX', 'WV', 'MA', 'NM', 'NC', 'IL', 'AZ', 'IN', 'MO', 'HI',
'NE', 'NH', 'WA', 'KY', 'SC', 'TN', 'MN', 'LA', 'RI', 'VA', 'UT',
'AL', 'ND', 'OH', 'MI', 'ID', 'KS', 'DE', 'OR', 'WY', 'AK', 'WI',
'ME', 'AR', 'MS', 'OK', 'VT', 'DC', 'IA'], dtype=object)
df_temp = woe_discrete(df_inputs_prepr, 'addr_state', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\anaconda3\envs\envname\Lib\site-packages\pandas\core\arraylike.py:399: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
| addr_state | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | IA | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | -inf | NaN | NaN | inf |
| 1 | ME | 395 | 0.121519 | 0.001440 | 48.0 | 347.0 | 0.000816 | 0.001611 | -0.679654 | 0.121519 | inf | inf |
| 2 | NH | 1293 | 0.146945 | 0.004715 | 190.0 | 1103.0 | 0.003231 | 0.005120 | -0.460296 | 0.025426 | 0.219359 | inf |
| 3 | DC | 696 | 0.152299 | 0.002538 | 106.0 | 590.0 | 0.001803 | 0.002739 | -0.418214 | 0.005354 | 0.042082 | inf |
| 4 | OR | 3271 | 0.153470 | 0.011928 | 502.0 | 2769.0 | 0.008537 | 0.012853 | -0.409172 | 0.001171 | 0.009042 | inf |
| 5 | VT | 517 | 0.160542 | 0.001885 | 83.0 | 434.0 | 0.001412 | 0.002015 | -0.355734 | 0.007072 | 0.053437 | inf |
| 6 | KS | 2261 | 0.163644 | 0.008245 | 370.0 | 1891.0 | 0.006292 | 0.008778 | -0.332889 | 0.003103 | 0.022846 | inf |
| 7 | CO | 6119 | 0.165550 | 0.022313 | 1013.0 | 5106.0 | 0.017227 | 0.023701 | -0.319031 | 0.001906 | 0.013858 | inf |
| 8 | WY | 601 | 0.166389 | 0.002192 | 100.0 | 501.0 | 0.001701 | 0.002326 | -0.312966 | 0.000839 | 0.006064 | inf |
| 9 | WV | 1015 | 0.166502 | 0.003701 | 169.0 | 846.0 | 0.002874 | 0.003927 | -0.312151 | 0.000113 | 0.000815 | inf |
| 10 | RI | 1226 | 0.177814 | 0.004471 | 218.0 | 1008.0 | 0.003707 | 0.004679 | -0.232759 | 0.011312 | 0.079392 | inf |
| 11 | WA | 5989 | 0.182167 | 0.021839 | 1091.0 | 4898.0 | 0.018554 | 0.022736 | -0.203263 | 0.004353 | 0.029496 | inf |
| 12 | ND | 318 | 0.182390 | 0.001160 | 58.0 | 260.0 | 0.000986 | 0.001207 | -0.201769 | 0.000223 | 0.001494 | inf |
| 13 | SC | 3262 | 0.182403 | 0.011895 | 595.0 | 2667.0 | 0.010119 | 0.012380 | -0.201679 | 0.000013 | 0.000091 | inf |
| 14 | UT | 2064 | 0.185078 | 0.007526 | 382.0 | 1682.0 | 0.006496 | 0.007808 | -0.183849 | 0.002674 | 0.017830 | inf |
| 15 | CT | 4031 | 0.188787 | 0.014699 | 761.0 | 3270.0 | 0.012942 | 0.015179 | -0.159442 | 0.003709 | 0.024406 | inf |
| 16 | IL | 10477 | 0.189367 | 0.038205 | 1984.0 | 8493.0 | 0.033740 | 0.039423 | -0.155658 | 0.000580 | 0.003785 | inf |
| 17 | WI | 3544 | 0.189898 | 0.012923 | 673.0 | 2871.0 | 0.011445 | 0.013327 | -0.152201 | 0.000531 | 0.003457 | inf |
| 18 | MA | 6381 | 0.200282 | 0.023268 | 1278.0 | 5103.0 | 0.021734 | 0.023687 | -0.086063 | 0.010384 | 0.066138 | inf |
| 19 | HI | 1360 | 0.201471 | 0.004959 | 274.0 | 1086.0 | 0.004660 | 0.005041 | -0.078659 | 0.001189 | 0.007404 | inf |
| 20 | GA | 8846 | 0.203595 | 0.032257 | 1801.0 | 7045.0 | 0.030628 | 0.032702 | -0.065507 | 0.002124 | 0.013152 | inf |
| 21 | DE | 765 | 0.205229 | 0.002790 | 157.0 | 608.0 | 0.002670 | 0.002822 | -0.055460 | 0.001634 | 0.010047 | inf |
| 22 | MN | 4862 | 0.205265 | 0.017729 | 998.0 | 3864.0 | 0.016972 | 0.017936 | -0.055235 | 0.000037 | 0.000224 | inf |
| 23 | MT | 761 | 0.210250 | 0.002775 | 160.0 | 601.0 | 0.002721 | 0.002790 | -0.024952 | 0.004984 | 0.030284 | inf |
| 24 | SD | 585 | 0.211966 | 0.002133 | 124.0 | 461.0 | 0.002109 | 0.002140 | -0.014647 | 0.001716 | 0.010305 | inf |
| 25 | CA | 40010 | 0.213472 | 0.145897 | 8541.0 | 31469.0 | 0.145250 | 0.146074 | -0.005655 | 0.001506 | 0.008992 | inf |
| 26 | NC | 7750 | 0.215097 | 0.028261 | 1667.0 | 6083.0 | 0.028349 | 0.028236 | 0.003997 | 0.001625 | 0.009652 | inf |
| 27 | TX | 22218 | 0.215276 | 0.081018 | 4783.0 | 17435.0 | 0.081341 | 0.080930 | 0.005058 | 0.000179 | 0.001061 | inf |
| 28 | AZ | 6583 | 0.215707 | 0.024005 | 1420.0 | 5163.0 | 0.024149 | 0.023966 | 0.007609 | 0.000431 | 0.002551 | inf |
| 29 | AK | 649 | 0.215716 | 0.002367 | 140.0 | 509.0 | 0.002381 | 0.002363 | 0.007664 | 0.000009 | 0.000055 | inf |
| 30 | KY | 2642 | 0.216124 | 0.009634 | 571.0 | 2071.0 | 0.009711 | 0.009613 | 0.010072 | 0.000408 | 0.002408 | inf |
| 31 | IN | 4508 | 0.218057 | 0.016439 | 983.0 | 3525.0 | 0.016717 | 0.016362 | 0.021443 | 0.001933 | 0.011371 | inf |
| 32 | MO | 4352 | 0.218061 | 0.015870 | 949.0 | 3403.0 | 0.016139 | 0.015796 | 0.021466 | 0.000004 | 0.000023 | inf |
| 33 | MI | 7295 | 0.218780 | 0.026601 | 1596.0 | 5699.0 | 0.027142 | 0.026454 | 0.025679 | 0.000719 | 0.004214 | inf |
| 34 | OH | 8954 | 0.220013 | 0.032651 | 1970.0 | 6984.0 | 0.033502 | 0.032419 | 0.032881 | 0.001233 | 0.007202 | inf |
| 35 | PA | 9237 | 0.220526 | 0.033683 | 2037.0 | 7200.0 | 0.034642 | 0.033421 | 0.035867 | 0.000513 | 0.002985 | inf |
| 36 | VA | 7728 | 0.223732 | 0.028180 | 1729.0 | 5999.0 | 0.029404 | 0.027846 | 0.054420 | 0.003206 | 0.018553 | inf |
| 37 | TN | 4096 | 0.225342 | 0.014936 | 923.0 | 3173.0 | 0.015697 | 0.014729 | 0.063666 | 0.001610 | 0.009246 | inf |
| 38 | FL | 19831 | 0.227220 | 0.072314 | 4506.0 | 15325.0 | 0.076630 | 0.071136 | 0.074394 | 0.001878 | 0.010728 | inf |
| 39 | NJ | 9932 | 0.227547 | 0.036217 | 2260.0 | 7672.0 | 0.038434 | 0.035612 | 0.076257 | 0.000327 | 0.001863 | inf |
| 40 | NM | 1464 | 0.228142 | 0.005339 | 334.0 | 1130.0 | 0.005680 | 0.005245 | 0.079638 | 0.000595 | 0.003381 | inf |
| 41 | NV | 4091 | 0.230017 | 0.014918 | 941.0 | 3150.0 | 0.016003 | 0.014622 | 0.090255 | 0.001875 | 0.010617 | inf |
| 42 | MD | 6357 | 0.236118 | 0.023181 | 1501.0 | 4856.0 | 0.025526 | 0.022541 | 0.124386 | 0.006101 | 0.034131 | inf |
| 43 | NY | 22427 | 0.238329 | 0.081781 | 5345.0 | 17082.0 | 0.090898 | 0.079292 | 0.136606 | 0.002211 | 0.012220 | inf |
| 44 | AL | 3323 | 0.242552 | 0.012117 | 806.0 | 2517.0 | 0.013707 | 0.011684 | 0.159730 | 0.004223 | 0.023124 | inf |
| 45 | LA | 3183 | 0.251021 | 0.011607 | 799.0 | 2384.0 | 0.013588 | 0.011066 | 0.205295 | 0.008469 | 0.045565 | inf |
| 46 | ID | 307 | 0.254072 | 0.001119 | 78.0 | 229.0 | 0.001326 | 0.001063 | 0.221456 | 0.003051 | 0.016161 | inf |
| 47 | OK | 2590 | 0.264479 | 0.009444 | 685.0 | 1905.0 | 0.011649 | 0.008843 | 0.275651 | 0.010407 | 0.054195 | inf |
| 48 | AR | 2022 | 0.266568 | 0.007373 | 539.0 | 1483.0 | 0.009166 | 0.006884 | 0.286363 | 0.002089 | 0.010712 | inf |
| 49 | NE | 741 | 0.271255 | 0.002702 | 201.0 | 540.0 | 0.003418 | 0.002507 | 0.310205 | 0.004687 | 0.023843 | inf |
| 50 | MS | 1304 | 0.278374 | 0.004755 | 363.0 | 941.0 | 0.006173 | 0.004368 | 0.345929 | 0.007119 | 0.035724 | inf |
plot_by_woe(df_temp)
# We plot the weight of evidence values.
plot_by_woe(df_temp.iloc[2: -2, : ])
# We plot the weight of evidence values.
plot_by_woe(df_temp.iloc[6: -6, : ])
# We plot the weight of evidence values.
df_var_dummies = [pd.get_dummies(df_inputs_prepr['addr_state'], prefix = 'addr_state', prefix_sep = ':')]
# We create dummy variables from original independent variables, and save them into a list.
# Note that we are using a particular naming convention for all variables: original variable name, colon, category name.
df_var_dummies = pd.concat(df_var_dummies, axis = 1)
# We concatenate the dummy variables and this turns them into a dataframe.
# We create the following categories:
# 'ND' 'NE' 'IA' NV' 'FL' 'HI' 'AL'
# 'NM' 'VA'
# 'NY'
# 'OK' 'TN' 'MO' 'LA' 'MD' 'NC'
# 'CA'
# 'UT' 'KY' 'AZ' 'NJ'
# 'AR' 'MI' 'PA' 'OH' 'MN'
# 'RI' 'MA' 'DE' 'SD' 'IN'
# 'GA' 'WA' 'OR'
# 'WI' 'MT'
# 'TX'
# 'IL' 'CT'
# 'KS' 'SC' 'CO' 'VT' 'AK' 'MS'
# 'WV' 'NH' 'WY' 'DC' 'ME' 'ID'
# 'IA_NV_HI_ID_AL_FL' will be the reference category.
df_inputs_prepr['addr_state:ND_NE_IA_NV_FL_HI_AL'] = sum([df_var_dummies['addr_state:ND'], df_var_dummies['addr_state:NE'],
df_var_dummies['addr_state:IA'], df_var_dummies['addr_state:NV'],
df_var_dummies['addr_state:FL'], df_var_dummies['addr_state:HI'],
df_var_dummies['addr_state:AL']])
df_inputs_prepr['addr_state:NM_VA'] = sum([df_var_dummies['addr_state:NM'], df_var_dummies['addr_state:VA']])
df_inputs_prepr['addr_state:OK_TN_MO_LA_MD_NC'] = sum([df_var_dummies['addr_state:OK'], df_var_dummies['addr_state:TN'],
df_var_dummies['addr_state:MO'], df_var_dummies['addr_state:LA'],
df_var_dummies['addr_state:MD'], df_var_dummies['addr_state:NC']])
df_inputs_prepr['addr_state:UT_KY_AZ_NJ'] = sum([df_var_dummies['addr_state:UT'], df_var_dummies['addr_state:KY'],
df_var_dummies['addr_state:AZ'], df_var_dummies['addr_state:NJ']])
df_inputs_prepr['addr_state:AR_MI_PA_OH_MN'] = sum([df_var_dummies['addr_state:AR'], df_var_dummies['addr_state:MI'],
df_var_dummies['addr_state:PA'], df_var_dummies['addr_state:OH'],
df_var_dummies['addr_state:MN']])
df_inputs_prepr['addr_state:RI_MA_DE_SD_IN'] = sum([df_var_dummies['addr_state:RI'], df_var_dummies['addr_state:MA'],
df_var_dummies['addr_state:DE'], df_var_dummies['addr_state:SD'],
df_var_dummies['addr_state:IN']])
df_inputs_prepr['addr_state:GA_WA_OR'] = sum([df_var_dummies['addr_state:GA'], df_var_dummies['addr_state:WA'],
df_var_dummies['addr_state:OR']])
df_inputs_prepr['addr_state:WI_MT'] = sum([df_var_dummies['addr_state:WI'], df_var_dummies['addr_state:MT']])
df_inputs_prepr['addr_state:IL_CT'] = sum([df_var_dummies['addr_state:IL'], df_var_dummies['addr_state:CT']])
df_inputs_prepr['addr_state:KS_SC_CO_VT_AK_MS'] = sum([df_var_dummies['addr_state:KS'], df_var_dummies['addr_state:SC'],
df_var_dummies['addr_state:CO'], df_var_dummies['addr_state:VT'],
df_var_dummies['addr_state:AK'], df_var_dummies['addr_state:MS']])
df_inputs_prepr['addr_state:WV_NH_WY_DC_ME_ID'] = sum([df_var_dummies['addr_state:WV'], df_var_dummies['addr_state:NH'],
df_var_dummies['addr_state:WY'], df_var_dummies['addr_state:DC'],
df_var_dummies['addr_state:ME'], df_var_dummies['addr_state:ID']])
Variable 'verification_status'¶
# 'verification_status'
df_temp = woe_discrete(df_inputs_prepr, 'verification_status', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| verification_status | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Not Verified | 82536 | 0.160887 | 0.300969 | 13279.0 | 69257.0 | 0.225826 | 0.321480 | -0.353171 | NaN | NaN | 0.050456 |
| 1 | Source Verified | 106486 | 0.225344 | 0.388303 | 23996.0 | 82490.0 | 0.408081 | 0.382905 | 0.063680 | 0.064457 | 0.416850 | 0.050456 |
| 2 | Verified | 85212 | 0.252629 | 0.310727 | 21527.0 | 63685.0 | 0.366093 | 0.295615 | 0.213828 | 0.027285 | 0.150149 | 0.050456 |
plot_by_woe(df_temp)
# We plot the weight of evidence values.
df_var_dummies = [pd.get_dummies(df_inputs_prepr['verification_status'], prefix = 'verification_status', prefix_sep = ':')]
# We create dummy variables from original independent variables, and save them into a list.
# Note that we are using a particular naming convention for all variables: original variable name, colon, category name.
df_var_dummies = pd.concat(df_var_dummies, axis = 1)
# We concatenate the dummy variables and this turns them into a dataframe.
df_inputs_prepr = pd.concat([df_inputs_prepr, df_var_dummies], axis = 1)
# Concatenates two dataframes.
# Here we concatenate the dataframe with original data with the dataframe with dummy variables, along the columns.
Variable 'purpose'¶
# 'purpose'
df_temp = woe_discrete(df_inputs_prepr, 'purpose', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| purpose | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | wedding | 456 | 0.133772 | 0.001663 | 61.0 | 395.0 | 0.001037 | 0.001834 | -0.569542 | NaN | NaN | 0.019054 |
| 1 | car | 2922 | 0.148186 | 0.010655 | 433.0 | 2489.0 | 0.007364 | 0.011554 | -0.450429 | 0.014414 | 0.119113 | 0.019054 |
| 2 | educational | 66 | 0.181818 | 0.000241 | 12.0 | 54.0 | 0.000204 | 0.000251 | -0.205608 | 0.033632 | 0.244821 | 0.019054 |
| 3 | credit_card | 59492 | 0.182024 | 0.216939 | 10829.0 | 48663.0 | 0.184160 | 0.225886 | -0.204222 | 0.000206 | 0.001386 | 0.019054 |
| 4 | home_improvement | 17962 | 0.195246 | 0.065499 | 3507.0 | 14455.0 | 0.059641 | 0.067098 | -0.117810 | 0.013221 | 0.086412 | 0.019054 |
| 5 | major_purchase | 6053 | 0.198744 | 0.022072 | 1203.0 | 4850.0 | 0.020458 | 0.022513 | -0.095691 | 0.003499 | 0.022119 | 0.019054 |
| 6 | vacation | 1864 | 0.217275 | 0.006797 | 405.0 | 1459.0 | 0.006888 | 0.006772 | 0.016850 | 0.018530 | 0.112541 | 0.019054 |
| 7 | debt_consolidation | 159235 | 0.226244 | 0.580654 | 36026.0 | 123209.0 | 0.612666 | 0.571916 | 0.068828 | 0.008970 | 0.051978 | 0.019054 |
| 8 | other | 16206 | 0.228372 | 0.059096 | 3701.0 | 12505.0 | 0.062940 | 0.058046 | 0.080944 | 0.002128 | 0.012116 | 0.019054 |
| 9 | moving | 2014 | 0.237339 | 0.007344 | 478.0 | 1536.0 | 0.008129 | 0.007130 | 0.131143 | 0.008966 | 0.050199 | 0.019054 |
| 10 | medical | 3246 | 0.240604 | 0.011837 | 781.0 | 2465.0 | 0.013282 | 0.011442 | 0.149098 | 0.003265 | 0.017954 | 0.019054 |
| 11 | house | 1477 | 0.247800 | 0.005386 | 366.0 | 1111.0 | 0.006224 | 0.005157 | 0.188087 | 0.007196 | 0.038989 | 0.019054 |
| 12 | renewable_energy | 173 | 0.254335 | 0.000631 | 44.0 | 129.0 | 0.000748 | 0.000599 | 0.222847 | 0.006536 | 0.034760 | 0.019054 |
| 13 | small_business | 3068 | 0.311604 | 0.011188 | 956.0 | 2112.0 | 0.016258 | 0.009804 | 0.505837 | 0.057268 | 0.282990 | 0.019054 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
df_var_dummies = [pd.get_dummies(df_inputs_prepr['purpose'], prefix = 'purpose', prefix_sep = ':')]
# We create dummy variables from original independent variables, and save them into a list.
# Note that we are using a particular naming convention for all variables: original variable name, colon, category name.
df_var_dummies = pd.concat(df_var_dummies, axis = 1)
# We concatenate the dummy variables and this turns them into a dataframe.
# We combine 'small_business', 'moving', 'renewable_energy', 'house' and 'medical' in one category: 'sm_b__mov__ren_en__house__medic'.
# We combine 'other', 'vacation' and 'major_purchase' in one category: other__vacat__maj_purch'.
# We combine 'home_improvement', ''wedding'educational', 'car' and in one category: 'home_impr__educ__car__wed'.
# We leave 'debt_consolidtion' in a separate category.
# We leave 'credit_card' in a separate category.
#'sm_b__mov__ren_en__house__medic' will be the reference category.
df_inputs_prepr['purpose:debt_consolidation'] = df_var_dummies['purpose:debt_consolidation']
df_inputs_prepr['purpose:credit_card'] = df_var_dummies['purpose:credit_card']
df_inputs_prepr['purpose:sm_b__mov__ren_en__house__medic'] = sum([df_var_dummies['purpose:small_business'], df_var_dummies['purpose:moving'],
df_var_dummies['purpose:renewable_energy'],df_var_dummies['purpose:house'],
df_var_dummies['purpose:medical']])
df_inputs_prepr['purpose:other__vacat__maj_purch'] = sum([df_var_dummies['purpose:other'], df_var_dummies['purpose:major_purchase'],
df_var_dummies['purpose:vacation']])
df_inputs_prepr['purpose:home_impr__educ__car__wed'] = sum([df_var_dummies['purpose:home_improvement'], df_var_dummies['purpose:car'],
df_var_dummies['purpose:home_improvement'], df_var_dummies['purpose:wedding']])
Variables: 'initial_list_status', 'application_type', 'hardship_flag', 'disbursement_method', 'debt_settlement_flag'¶
Each of these variables has only 2 unique categories.
loan_data_dummies = [pd.get_dummies(df_inputs_prepr['initial_list_status'], prefix = 'initial_list_status', prefix_sep = ':'),
pd.get_dummies(df_inputs_prepr['application_type'], prefix = 'application_type', prefix_sep = ':'),
pd.get_dummies(df_inputs_prepr['hardship_flag'], prefix = 'hardship_flag', prefix_sep = ':'),
pd.get_dummies(df_inputs_prepr['disbursement_method'], prefix = 'disbursement_method', prefix_sep = ':'),
pd.get_dummies(df_inputs_prepr['debt_settlement_flag'], prefix = 'debt_settlement_flag', prefix_sep = ':')]
# We create dummy variables from all these original independent variables, and save them into a list.
# Note that we are using a particular naming convention for all variables: original variable name, colon, category name.
loan_data_dummies = pd.concat(loan_data_dummies, axis = 1)
# We concatenate the dummy variables and this turns them into a dataframe.
df_inputs_prepr = pd.concat([df_inputs_prepr, loan_data_dummies], axis = 1)
# Concatenates two dataframes.
# Here we concatenate the dataframe with original data with the dataframe with dummy variables, along the columns.
convert all True and False values to 1 and 0
df_inputs_prepr = df_inputs_prepr.replace({True: 1, False: 0})
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2025257523.py:1: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
df_inputs_prepr = df_inputs_prepr.replace({True: 1, False: 0})
df_inputs_prepr.head()
| loan_amnt | funded_amnt | funded_amnt_inv | int_rate | installment | grade | home_ownership | annual_inc | verification_status | purpose | addr_state | dti | delinq_2yrs | fico_range_low | fico_range_high | inq_last_6mths | mths_since_last_delinq | mths_since_last_record | open_acc | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | out_prncp | out_prncp_inv | total_pymnt | total_pymnt_inv | total_rec_prncp | total_rec_int | total_rec_late_fee | recoveries | collection_recovery_fee | last_pymnt_amnt | last_fico_range_high | last_fico_range_low | collections_12_mths_ex_med | mths_since_last_major_derog | policy_code | application_type | acc_now_delinq | tot_coll_amt | tot_cur_bal | open_acc_6m | open_act_il | open_il_12m | open_il_24m | mths_since_rcnt_il | total_bal_il | il_util | open_rv_12m | open_rv_24m | max_bal_bc | all_util | total_rev_hi_lim | inq_fi | total_cu_tl | inq_last_12m | acc_open_past_24mths | avg_cur_bal | bc_open_to_buy | bc_util | chargeoff_within_12_mths | delinq_amnt | mo_sin_old_il_acct | mo_sin_old_rev_tl_op | mo_sin_rcnt_rev_tl_op | mo_sin_rcnt_tl | mort_acc | mths_since_recent_bc | mths_since_recent_bc_dlq | mths_since_recent_inq | mths_since_recent_revol_delinq | num_accts_ever_120_pd | num_actv_bc_tl | num_actv_rev_tl | num_bc_sats | num_bc_tl | num_il_tl | num_op_rev_tl | num_rev_accts | num_rev_tl_bal_gt_0 | num_sats | num_tl_120dpd_2m | num_tl_30dpd | num_tl_90g_dpd_24m | num_tl_op_past_12m | pct_tl_nvr_dlq | percent_bc_gt_75 | pub_rec_bankruptcies | tax_liens | tot_hi_cred_lim | total_bal_ex_mort | total_bc_limit | total_il_high_credit_limit | hardship_flag | disbursement_method | debt_settlement_flag | emp_length_int | term_int | mths_since_issue_d | mths_since_earliest_cr_line | months_since_last_pymnt | months_since_last_credit_pull | grade:A | grade:B | grade:C | grade:D | grade:E | grade:F | grade:G | home_ownership:MORTGAGE | home_ownership:OWN | home_ownership:RENT_OTHER_NONE_ANY | addr_state:ND_NE_IA_NV_FL_HI_AL | addr_state:NM_VA | addr_state:OK_TN_MO_LA_MD_NC | addr_state:UT_KY_AZ_NJ | addr_state:AR_MI_PA_OH_MN | addr_state:RI_MA_DE_SD_IN | addr_state:GA_WA_OR | addr_state:WI_MT | addr_state:IL_CT | addr_state:KS_SC_CO_VT_AK_MS | addr_state:WV_NH_WY_DC_ME_ID | verification_status:Not Verified | verification_status:Source Verified | verification_status:Verified | purpose:debt_consolidation | purpose:credit_card | purpose:sm_b__mov__ren_en__house__medic | purpose:other__vacat__maj_purch | purpose:home_impr__educ__car__wed | initial_list_status:f | initial_list_status:w | application_type:Individual | application_type:Joint App | hardship_flag:N | hardship_flag:Y | disbursement_method:Cash | disbursement_method:DirectPay | debt_settlement_flag:N | debt_settlement_flag:Y | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 299291 | 12000.0 | 12000.0 | 12000.0 | 13.99 | 279.16 | C | OWN | 30000.0 | Source Verified | debt_consolidation | SD | 25.32 | 0.0 | 675.0 | 679.0 | 1.0 | 76.0 | 999.0 | 19.0 | 0.0 | 11405.0 | 60.3 | 35.0 | w | 0.0 | 0.0 | 13667.840000 | 13667.84 | 12000.00 | 1667.84 | 0.0 | 0.00 | 0.0000 | 10615.73 | 574.0 | 570.0 | 0.0 | 999.0 | 1.0 | Individual | 0.0 | 0.0 | 88510.0 | 0.0 | 0.0 | 0.0 | 0.0 | 999.0 | 0.0 | 72.0 | 0.0 | 0.0 | 0.0 | 58.0 | 18900.0 | 0.0 | 0.0 | 0.0 | 4.0 | 4658.0 | 1221.0 | 83.3 | 0.0 | 0.0 | 127.0 | 121.0 | 15.0 | 5.0 | 0.0 | 36.0 | 76.0 | 5.0 | 76.0 | 0.0 | 6.0 | 11.0 | 6.0 | 14.0 | 7.0 | 14.0 | 28.0 | 11.0 | 19.0 | 0.0 | 0.0 | 0.0 | 1.0 | 97.1 | 66.7 | 0.0 | 0.0 | 97351.0 | 88510.0 | 7300.0 | 78451.0 | N | Cash | N | 7.0 | 60.0 | 68.0 | 198.0 | 57.0 | 22.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 2099335 | 35000.0 | 35000.0 | 35000.0 | 18.06 | 889.92 | D | RENT | 140000.0 | Source Verified | debt_consolidation | CA | 20.49 | 0.0 | 695.0 | 699.0 | 2.0 | 999.0 | 999.0 | 12.0 | 0.0 | 30808.0 | 20.0 | 18.0 | w | 0.0 | 0.0 | 11582.320000 | 11582.32 | 3063.09 | 3986.04 | 0.0 | 4533.19 | 815.9742 | 889.92 | 574.0 | 570.0 | 0.0 | 999.0 | 1.0 | Individual | 0.0 | 0.0 | 91853.0 | 3.0 | 4.0 | 4.0 | 4.0 | 4.0 | 61045.0 | 87.0 | 2.0 | 5.0 | 11140.0 | 20.0 | 157000.0 | 0.0 | 2.0 | 4.0 | 9.0 | 7654.0 | 19625.0 | 20.0 | 0.0 | 0.0 | 93.0 | 72.0 | 3.0 | 3.0 | 0.0 | 3.0 | 999.0 | 3.0 | 999.0 | 0.0 | 8.0 | 8.0 | 8.0 | 8.0 | 9.0 | 8.0 | 9.0 | 8.0 | 12.0 | 0.0 | 0.0 | 0.0 | 6.0 | 100.0 | 0.0 | 0.0 | 0.0 | 227204.0 | 91853.0 | 157000.0 | 70204.0 | N | Cash | N | 2.0 | 60.0 | 38.0 | 132.0 | 30.0 | 24.0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 113647 | 8400.0 | 8400.0 | 8400.0 | 12.29 | 280.17 | C | MORTGAGE | 70495.0 | Verified | other | GA | 16.04 | 0.0 | 660.0 | 664.0 | 1.0 | 44.0 | 999.0 | 19.0 | 0.0 | 16940.0 | 94.1 | 39.0 | w | 0.0 | 0.0 | 10008.154816 | 10008.15 | 8400.00 | 1608.15 | 0.0 | 0.00 | 0.0000 | 196.31 | 709.0 | 705.0 | 0.0 | 45.0 | 1.0 | Individual | 0.0 | 79.0 | 145252.0 | 0.0 | 0.0 | 0.0 | 0.0 | 999.0 | 0.0 | 72.0 | 0.0 | 0.0 | 0.0 | 58.0 | 18000.0 | 0.0 | 0.0 | 0.0 | 8.0 | 7645.0 | 207.0 | 98.0 | 0.0 | 0.0 | 98.0 | 101.0 | 3.0 | 3.0 | 0.0 | 16.0 | 44.0 | 3.0 | 44.0 | 2.0 | 2.0 | 5.0 | 2.0 | 3.0 | 28.0 | 5.0 | 10.0 | 5.0 | 19.0 | 0.0 | 0.0 | 0.0 | 2.0 | 92.1 | 100.0 | 0.0 | 0.0 | 142744.0 | 145252.0 | 10500.0 | 124664.0 | N | Cash | N | 2.0 | 36.0 | 63.0 | 165.0 | 28.0 | 28.0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 180785 | 18000.0 | 18000.0 | 18000.0 | 7.89 | 563.15 | A | RENT | 68000.0 | Not Verified | debt_consolidation | NV | 11.12 | 0.0 | 705.0 | 709.0 | 0.0 | 81.0 | 999.0 | 6.0 | 0.0 | 7540.0 | 55.4 | 21.0 | w | 0.0 | 0.0 | 20235.797565 | 20235.80 | 18000.00 | 2235.80 | 0.0 | 0.00 | 0.0000 | 2230.78 | 714.0 | 710.0 | 0.0 | 999.0 | 1.0 | Individual | 0.0 | 0.0 | 33713.0 | 0.0 | 0.0 | 0.0 | 0.0 | 999.0 | 0.0 | 72.0 | 0.0 | 0.0 | 0.0 | 58.0 | 13600.0 | 0.0 | 0.0 | 0.0 | 2.0 | 5619.0 | 4353.0 | 63.1 | 0.0 | 0.0 | 94.0 | 96.0 | 9.0 | 9.0 | 0.0 | 9.0 | 81.0 | 21.0 | 81.0 | 0.0 | 3.0 | 4.0 | 3.0 | 5.0 | 14.0 | 4.0 | 7.0 | 4.0 | 6.0 | 0.0 | 0.0 | 0.0 | 1.0 | 94.7 | 33.3 | 0.0 | 0.0 | 46695.0 | 33713.0 | 11800.0 | 33095.0 | N | Cash | N | 0.0 | 36.0 | 65.0 | 162.0 | 32.0 | 27.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 1805875 | 12000.0 | 12000.0 | 12000.0 | 21.60 | 455.81 | E | RENT | 165000.0 | Not Verified | credit_card | NY | 7.53 | 1.0 | 680.0 | 684.0 | 3.0 | 10.0 | 999.0 | 12.0 | 0.0 | 7883.0 | 47.2 | 25.0 | f | 0.0 | 0.0 | 13956.129946 | 13956.13 | 12000.00 | 1956.13 | 0.0 | 0.00 | 0.0000 | 9854.71 | 749.0 | 745.0 | 0.0 | 10.0 | 1.0 | Individual | 0.0 | 0.0 | 26652.0 | 0.0 | 0.0 | 0.0 | 0.0 | 999.0 | 0.0 | 72.0 | 0.0 | 0.0 | 0.0 | 58.0 | 16700.0 | 0.0 | 0.0 | 0.0 | 10.0 | 2221.0 | 327.0 | 85.1 | 0.0 | 0.0 | 22.0 | 162.0 | 2.0 | 2.0 | 0.0 | 2.0 | 34.0 | 0.0 | 34.0 | 1.0 | 2.0 | 7.0 | 2.0 | 11.0 | 2.0 | 9.0 | 22.0 | 7.0 | 12.0 | 0.0 | 0.0 | 1.0 | 5.0 | 86.0 | 100.0 | 0.0 | 0.0 | 42613.0 | 26652.0 | 2200.0 | 16500.0 | N | Cash | N | 2.0 | 36.0 | 88.0 | 251.0 | 79.0 | 22.0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
B. Preprocessing Continuous Variables¶
Check the list of numerical features¶
- Feature with more than 10 unique values will be considered as continuous.
- Feautre with less than or equal to 10 unique values will be considered as discreate.
# Step 1: Get numerical columns
num_cols = loan_data_inputs_train.select_dtypes(include=['float64', 'int64']).columns
# Step 2: Filter continuous features (optional rule of thumb: more than 10 unique values)
continuous_features = [col for col in num_cols if loan_data_inputs_train[col].nunique() > 10]
print("Continuous features:")
print(continuous_features)
print()
print('Number of continuous features: ', len(continuous_features))
Continuous features: ['min_mths_since_delinquency'] Number of continuous features: 1
# Get numerical features wit less than or equal 10 unique values (that can be considered as discrete features)
Other_features = [col for col in num_cols if loan_data_inputs_train[col].nunique() <= 10]
print("Numerical discrete features:")
print(Other_features)
print()
print('Number of numerical discrete features: ', len(Other_features))
Numerical discrete features: ['grade:A', 'grade:B', 'grade:C', 'grade:D', 'grade:E', 'grade:F', 'grade:G', 'home_ownership:MORTGAGE', 'home_ownership:OWN', 'verification_status:Not Verified', 'verification_status:Source Verified', 'verification_status:Verified', 'purpose:debt_consolidation', 'purpose:credit_card', 'initial_list_status:f', 'initial_list_status:w', 'application_type:Individual', 'application_type:Joint App', 'hardship_flag:N', 'hardship_flag:Y', 'disbursement_method:Cash', 'disbursement_method:DirectPay', 'debt_settlement_flag:N', 'debt_settlement_flag:Y'] Number of numerical discrete features: 24
We found 88 continuous features and 6 discrete features.
Here's how I suggest organizing the features for credit risk modeling, splitting them into three categories:
Check for multicollinearity, the most common approach is to calculate the correlation matrix and/or compute the Variance Inflation Factor (VIF) for each feature.¶
A proposed classification of the numerical features into distinct feature families based on their meaning and purpose in credit risk modeling:
1. Loan and Funding Information
These features relate to the loan’s original terms and funding:
- loan_amnt
- funded_amnt
- funded_amnt_inv
- term_int
- int_rate
- installment
2. Applicant Financial Profile
Measures of income, employment duration.
- annual_inc
- emp_length_int
- dti
3. Credit History and Delinquency
How the borrower has paid (or missed) obligations.
- delinq_2yrs
- mths_since_last_delinq
- mths_since_last_record
- collections_12_mths_ex_med
- mths_since_last_major_derog
- chargeoff_within_12_mths
- delinq_amnt
- acc_now_delinq
- num_accts_ever_120_pd
- num_tl_90g_dpd_24m
- num_tl_120dpd_2m
- num_tl_30dpd
4. Credit Utilization & Balance
Measures how much of available credit is being used:
- revol_bal
- revol_util
- il_util
- all_util
- bc_util
- bc_open_to_buy
- total_bal_ex_mort
- total_bal_il
- max_bal_bc
- avg_cur_bal
- total_bc_limit
5. Credit Limits
Total credit available across various types:
- tot_hi_cred_lim
- total_rev_hi_lim
- total_il_high_credit_limit
- tot_cur_bal
6. Credit Account Status
These features indicate the number and types of open/active accounts:
- open_acc
- total_acc
- open_acc_6m
- open_act_il
- open_il_12m
- open_il_24m
- open_rv_12m
- open_rv_24m
- num_sats
- num_il_tl
- num_rev_accts
- num_actv_bc_tl
- num_actv_rev_tl
- num_bc_sats
- num_bc_tl
- num_op_rev_tl
- num_rev_tl_bal_gt_0
- acc_open_past_24mths
- total_cu_tl
7. Credit Inquiries
Indicators of recent credit-seeking activity:
- inq_fi
- inq_last_6mths
- inq_last_12m
- mths_since_recent_inq
8. Payment and Recovery
These reflect payments and recovery-related metrics:
- out_prncp
- out_prncp_inv
- total_pymnt
- total_pymnt_inv
- total_rec_prncp
- total_rec_int
- total_rec_late_fee
- recoveries
- collection_recovery_fee
- last_pymnt_amnt
9. FICO Scores
Borrower’s credit score ranges:
- fico_range_low
- fico_range_high
- last_fico_range_high
- last_fico_range_low
10. Credit Line & History Timelines
Tracks age or recency of credit lines:
- mths_since_earliest_cr_line
- mths_since_issue_d
- mo_sin_old_il_acct
- mo_sin_old_rev_tl_op
- mo_sin_rcnt_rev_tl_op
- mo_sin_rcnt_tl
- mths_since_rcnt_il
- mths_since_recent_bc
- mths_since_recent_bc_dlq
- mths_since_recent_revol_delinq
11. Other / Miscellaneous
- percent_bc_gt_75
- pct_tl_nvr_dlq
- tax_liens
- pub_rec
- pub_rec_bankruptcies
- tot_coll_amt
- mort_acc
- months_since_last_pymnt
- months_since_last_credit_pull
- policy_code
Function that calculates the Variance Inflation Factor (VIF) for a given list of numerical features:¶
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calculate_vif(data, num_features, sample_size=10000, random_state=42):
"""
Calculates the Variance Inflation Factor (VIF) for the specified numerical features.
Parameters:
- data (pd.DataFrame): The full DataFrame.
- num_features (list): List of numerical features to include in the VIF calculation.
- sample_size (int): Number of rows to sample for computation.
- random_state (int): Seed for reproducibility.
Returns:
- pd.DataFrame: Sorted DataFrame with features and their VIF values.
"""
# Drop rows with missing values in the selected features
X = data[num_features].dropna()
# Sample the data to speed up VIF calculation
X_sample = X.sample(n=sample_size, random_state=random_state)
# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data["feature"] = X_sample.columns
vif_data["VIF"] = [variance_inflation_factor(X_sample.values, i) for i in range(X_sample.shape[1])]
# Sort by VIF in descending order
vif_data = vif_data.sort_values(by='VIF', ascending=False).reset_index(drop=True)
return vif_data
VIF > 10 suggests serious multicollinearity, meaning the feature is highly predictable from other features and could distort model interpretation and stability.
VIF between 5–10 indicates moderate correlation.
VIF < 5 is generally considered acceptable.
Function version of the highly correlated feature removal process:¶
def drop_highly_correlated_features(data, threshold=0.95):
"""
Drops one of each pair of features with absolute correlation higher than the threshold.
Parameters:
- data (pd.DataFrame): Input DataFrame with numerical features.
- threshold (float): Correlation threshold for dropping features.
Returns:
- pd.DataFrame: DataFrame with reduced features.
- list: List of dropped features.
"""
# Compute absolute correlation matrix
corr_matrix = data.corr().abs()
# Take the upper triangle of the correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
# Find columns with any correlation higher than the threshold
to_drop = [col for col in upper.columns if any(upper[col] > threshold)]
# Drop those columns
reduced_data = data.drop(columns=to_drop)
return reduced_data, to_drop
1. Loan Information¶
# List of the Loan Information features
num_features = ['loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term_int', 'int_rate', 'installment']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 funded_amnt 10171.338802 1 funded_amnt_inv 5723.614303 2 loan_amnt 3901.089311 3 installment 68.475463 4 term_int 21.619893 5 int_rate 15.790025
Recommendations¶
1. Drop highly collinear variables:
Keep only one among:
- loan_amnt ✅ (commonly the most interpretable)
- funded_amnt ❌
- funded_amnt_inv ❌
Also consider dropping:
- installment ❌ (very collinear, can be recreated from loan amount + interest rate + term)
2. Keep these features:
- int_rate ✅ (Key driver of affordability and default)
- term_int ✅ (Can be binned into short vs long-term if needed)
VIF analysis of the new set of features.¶
# List of the Loan Information features
num_features = ['loan_amnt', 'term_int', 'int_rate']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 term_int 13.595521 1 int_rate 10.039828 2 loan_amnt 4.353243
The keeped variables are now moderate and low multicollinear.
2. Applicant Income & Employment¶
# List of the Loan Information features
num_features = ['annual_inc', 'emp_length_int', 'dti']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 emp_length_int 2.135642 1 dti 1.909319 2 annual_inc 1.369659
Recommendation¶
These features are generally informative and complementary, so we do not need to drop any of them.
3. Credit History and Delinquency¶
# List of the Loan Information features
num_features = ['delinq_2yrs', 'mths_since_last_delinq', 'mths_since_last_record', 'collections_12_mths_ex_med',
'mths_since_last_major_derog', 'chargeoff_within_12_mths', 'delinq_amnt', 'acc_now_delinq', 'num_accts_ever_120_pd',
'num_tl_90g_dpd_24m', 'num_tl_120dpd_2m', 'num_tl_30dpd']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 acc_now_delinq 6.438133 1 mths_since_last_major_derog 6.050945 2 num_tl_30dpd 5.342100 3 mths_since_last_record 4.121314 4 mths_since_last_delinq 3.568357 5 num_tl_120dpd_2m 2.516389 6 delinq_2yrs 2.242248 7 num_tl_90g_dpd_24m 2.021835 8 num_accts_ever_120_pd 1.450038 9 delinq_amnt 1.414542 10 chargeoff_within_12_mths 1.059052 11 collections_12_mths_ex_med 1.020881
Recommendations¶
1. Keep As-Is:
These features have low VIF, likely offer independent information, and are good predictors of risk:
- delinq_2yrs
- num_tl_90g_dpd_24m
- num_accts_ever_120_pd
- num_tl_120dpd_2m
- chargeoff_within_12_mths
- delinq_amnt
- collections_12_mths_ex_med
2. Watch for Moderate Multicollinearity:
These are "time since" delinquency features and may overlap:
- mths_since_last_major_derog
- mths_since_last_record
- mths_since_last_delinq
- acc_now_delinq
We can keep one or two of the most informative time-based ones.
Combine Features:
# If multiple time-based features overlap, consider taking the minimum or latest date:
# Taking the minimum of 'mths_since_last_delinq' and 'mths_since_last_major_derog'
df_inputs_prepr['min_mths_since_delinquency'] = df_inputs_prepr[['mths_since_last_delinq', 'mths_since_last_major_derog']].min(axis=1)
# 'acc_now_delinq' shows if the borrower is currently delinquent on any account.
# 'mths_since_last_record' indicates how recently a public derogatory record (e.g., bankruptcy, judgment) was filed.
# Recommended Approach: Feature Combination with Risk Buckets
# Create a flag for current delinquency:
df_inputs_prepr['has_delinquency_now'] = (df_inputs_prepr['acc_now_delinq'] > 0).astype(int)
# Bucketize mths_since_last_record
df_inputs_prepr['last_record_bucket'] = pd.cut(df_inputs_prepr['mths_since_last_record'], bins=[-1, 12, 24, 60, np.inf],
labels=['<1yr', '1-2yr', '2-5yr', '5+yr'])
# Combine into a single categorical feature
df_inputs_prepr['delinq_record_combo'] = df_inputs_prepr['has_delinquency_now'].astype(str) + '_' + df_inputs_prepr['last_record_bucket'].astype(str)
# Define risk levels manually = Transform 'delinq_record_combo' feature to numirical
risk_map = {
'1_<1yr': 7,
'1_1-2yr': 6,
'1_2-5yr': 5,
'1_5+yr': 4,
'0_<1yr': 3,
'0_1-2yr': 2,
'0_2-5yr': 1,
'0_5+yr': 0
}
df_inputs_prepr['delinq_record_risk_score'] = df_inputs_prepr['delinq_record_combo'].map(risk_map)
VIF analysis of the new set of features.¶
# List of the Loan Information features
num_features = ['min_mths_since_delinquency', 'delinq_record_risk_score', 'delinq_2yrs', 'collections_12_mths_ex_med',
'chargeoff_within_12_mths', 'delinq_amnt', 'num_accts_ever_120_pd',
'num_tl_90g_dpd_24m', 'num_tl_120dpd_2m', 'num_tl_30dpd']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 num_tl_90g_dpd_24m 1.846076 1 delinq_2yrs 1.813648 2 delinq_record_risk_score 1.583014 3 num_tl_120dpd_2m 1.526392 4 delinq_amnt 1.420302 5 num_tl_30dpd 1.411802 6 num_accts_ever_120_pd 1.166534 7 chargeoff_within_12_mths 1.056240 8 collections_12_mths_ex_med 1.016597 9 min_mths_since_delinquency 1.012894
The VIF values for all remaining delinquency-related features are well below the common multicollinearity threshold of 5, indicating that multicollinearity is no longer a concern in this subset. This suggests that each feature contributes uniquely to the model and provides distinct information about the borrower's credit risk. Therefore, no further feature removal is needed based on multicollinearity, and this set can be retained for modeling.
4. Credit Utilization & Balance¶
# List of the Loan Information features
num_features = ['revol_bal', 'total_bal_il', 'max_bal_bc', 'avg_cur_bal', 'revol_util', 'il_util',
'all_util', 'bc_util','bc_open_to_buy', 'total_bal_ex_mort', 'total_bc_limit']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 all_util 36.721871 1 il_util 28.903446 2 bc_util 19.463804 3 revol_util 19.446893 4 total_bc_limit 14.343601 5 bc_open_to_buy 9.326915 6 total_bal_ex_mort 5.450999 7 revol_bal 3.475286 8 total_bal_il 2.768232 9 max_bal_bc 2.050169 10 avg_cur_bal 1.905737
Recommendations:¶
We have a set of credit limit and balance-related features that are highly collinear. These are very common in credit scoring datasets and often show extreme multicollinearity because they represent variations of the same financial behavior.
Multicollinearity Issue: VIF > 10 means serious multicollinearity. We can not include all of them directly in a model — they will distort the coefficients and inflate standard errors.
1. Total Credit Utilization (Core Features)
These can be engineered into ratios, which provide more meaningful insights.
# Utilization Ratios
df_inputs_prepr['revol_bal_to_bc_limit'] = df_inputs_prepr['revol_bal'] / df_inputs_prepr['total_bc_limit'].replace(0, np.nan)
df_inputs_prepr['revol_bal_to_open_to_buy'] = df_inputs_prepr['revol_bal'] / df_inputs_prepr['bc_open_to_buy'].replace(0, np.nan)
# Balance to Income (if annual_inc is available):
df_inputs_prepr['total_bal_ex_mort_to_inc'] = df_inputs_prepr['total_bal_ex_mort'] / df_inputs_prepr['annual_inc'].replace(0, np.nan)
2. Drop Redundant Variables
Choose only one or two from each highly correlated group:
- From all_util, il_util, revol_util, bc_util: We keep revol_util (widely used in credit models) and we drop the rest.
- From total_bc_limit, bc_open_to_buy: We keep bc_open_to_buy (more dynamic).
Then we drop the original raw features used in these ratios and transformed features.
VIF analysis of the new set of features.¶
# List of the Loan Information features
num_features = ['revol_bal', 'total_bal_il', 'max_bal_bc', 'avg_cur_bal', 'bc_open_to_buy', 'revol_bal_to_bc_limit',
'revol_bal_to_open_to_buy', 'total_bal_ex_mort_to_inc']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 revol_bal 2.174330 1 total_bal_ex_mort_to_inc 2.077000 2 total_bal_il 1.819026 3 max_bal_bc 1.768437 4 avg_cur_bal 1.660353 5 bc_open_to_buy 1.414334 6 revol_bal_to_bc_limit 1.245156 7 revol_bal_to_open_to_buy 1.031061
The VIF values for all remaining delinquency-related features are well below the common multicollinearity threshold of 5, indicating that multicollinearity is no longer a concern in this subset. Therefore, no further feature removal is needed based on multicollinearity, and this set can be retained for modeling.
5. Credit Limits¶
# List of the Loan Information features.
num_features = ['tot_hi_cred_lim', 'total_rev_hi_lim', 'total_il_high_credit_limit', 'tot_cur_bal']
# Calculate and print Variance Inflation Factor (VIF).
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 tot_hi_cred_lim 86.048912 1 tot_cur_bal 70.228960 2 total_rev_hi_lim 3.281832 3 total_il_high_credit_limit 1.988739
Recommendations¶
1. combine 'tot_hi_cred_lim' and 'tot_cur_bal' into a credit utilization-like ratio:
Captures how much of the credit limit is being used overall.
df_inputs_prepr['total_balance_to_credit_ratio'] = df_inputs_prepr['tot_cur_bal'] / df_inputs_prepr['tot_hi_cred_lim'].replace(0, np.nan)
2. rev_to_il_limit_ratio = Installment vs. Revolving Exposure:
Gives insight into the borrower’s credit type distribution (revolving vs installment).
df_inputs_prepr['rev_to_il_limit_ratio'] = df_inputs_prepr['total_rev_hi_lim'] / df_inputs_prepr['total_il_high_credit_limit'].replace(0, np.nan)
3. Keep the feature 'total_rev_hi_lim' as is: The borrower’s available revolving credit limit — important for understanding credit card capacity.
VIF analysis of the new set of features.¶
# List of the Loan Information features.
num_features = ['total_balance_to_credit_ratio', 'rev_to_il_limit_ratio', 'total_il_high_credit_limit', 'tot_cur_bal']
# Calculate and print Variance Inflation Factor (VIF).
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 total_balance_to_credit_ratio 3.732316 1 total_il_high_credit_limit 3.040626 2 tot_cur_bal 2.632886 3 rev_to_il_limit_ratio 1.321904
The VIF values for all remaining engineered features are well below the threshold of 5, indicating low multicollinearity and suggesting that each variable provides unique and valuable information for modeling. This confirms the effectiveness of the feature engineering process in reducing redundancy while preserving predictive power.
6. Account and Credit Line Counts¶
# List of the Loan Information features
num_features = ['open_acc', 'total_acc', 'open_acc_6m', 'open_act_il', 'open_il_12m', 'open_il_24m', 'open_rv_12m',
'open_rv_24m', 'num_sats', 'num_il_tl', 'num_rev_accts', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats',
'num_bc_tl', 'num_op_rev_tl', 'num_rev_tl_bal_gt_0', 'acc_open_past_24mths', 'total_cu_tl']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 num_rev_tl_bal_gt_0 135.413933 1 num_actv_rev_tl 125.544110 2 num_sats 121.894356 3 open_acc 121.812303 4 total_acc 81.870765 5 num_rev_accts 75.563319 6 num_op_rev_tl 52.440646 7 num_bc_tl 29.505863 8 num_actv_bc_tl 26.112984 9 num_bc_sats 24.070302 10 num_il_tl 15.243944 11 acc_open_past_24mths 6.171402 12 open_rv_12m 6.050756 13 open_rv_24m 5.908277 14 open_il_24m 5.383007 15 open_il_12m 4.179422 16 open_acc_6m 3.607243 17 open_act_il 3.128423 18 total_cu_tl 1.451131
The VIF results show very high multicollinearity among many account number/count features, especially for revolving, bankcard, and installment accounts.
1. Recommended Features to KEEP (low redundancy, strong representation):¶
- open_acc: Broad indicator of currently open accounts; relatively interpretable.
- num_rev_tl_bal_gt_0: Reflects active revolving accounts with a balance — important for risk.
- num_il_tl: Captures installment loan history; low VIF and unique info.
- acc_open_past_24mths: Proxy for recent credit-seeking behavior; moderately correlated.
- total_cu_tl: Specific to credit union trades; unique dimension of credit profile.
2. Recommended Features to DROP (high VIF or redundant info):¶
- total_acc: Highly collinear with open_acc and num_sats.
- open_acc_6m: Overlaps with acc_open_past_24mths and others capturing recent openings.
- open_act_il: Correlates with num_il_tl, brings little new info.
- open_il_12m, open_il_24m: Temporal splits of installment openings — redundant with num_il_tl.
- open_rv_12m, open_rv_24m: Same issue with revolving trades, overlaps with num_rev_tl_bal_gt_0.
- num_sats: Nearly identical meaning to open_acc.
- num_rev_accts: Overlaps heavily with num_op_rev_tl and num_actv_rev_tl.
- num_actv_bc_tl, num_actv_rev_tl, num_bc_sats, num_bc_tl: Redundant with num_rev_tl_bal_gt_0.
- num_op_rev_tl: Captured by broader num_rev_tl_bal_gt_0.
VIF analysis of the new set of features.¶
# List of the Loan Information features
num_features = ['total_acc', 'open_act_il', 'open_il_12m', 'num_actv_rev_tl', 'open_rv_12m', 'num_bc_tl',
'open_acc_6m', 'acc_open_past_24mths', 'total_cu_tl']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 total_acc 7.873322 1 num_bc_tl 6.650931 2 num_actv_rev_tl 4.946451 3 acc_open_past_24mths 4.740237 4 open_acc_6m 3.602185 5 open_rv_12m 3.108141 6 open_il_12m 2.296064 7 open_act_il 1.777657 8 total_cu_tl 1.331291
The VIF values for all remaining engineered features are well below the threshold of 5, indicating low multicollinearity and suggesting that each variable provides unique and valuable information for modeling. This confirms the effectiveness of the feature engineering process in reducing redundancy while preserving predictive power.
7. Credit Inquiries¶
# List of the Loan Information features
num_features = ['inq_fi', 'inq_last_6mths', 'inq_last_12m', 'mths_since_recent_inq']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 inq_last_12m 2.346682 1 inq_fi 2.126177 2 inq_last_6mths 1.209716 3 mths_since_recent_inq 1.006674
1. Keep All Features¶
Each feature captures slightly different information:
- Frequency (how many inquiries: inq_last_12m, inq_last_6mths, inq_fi)
- Recency (how recent: mths_since_recent_inq)
Credit risk models care both about how many inquiries you had (volume) and how recently (recency).
8. Payment and Recovery¶
# List of the features you want to check
num_features = ['out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_amnt']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 total_pymnt 6.526956e+13 1 total_rec_prncp 4.503600e+13 2 total_rec_int 2.609270e+12 3 recoveries 1.796159e+11 4 total_rec_late_fee 2.310876e+07 5 out_prncp 1.035721e+06 6 out_prncp_inv 1.035720e+06 7 total_pymnt_inv 5.605819e+03 8 collection_recovery_fee 1.083006e+01 9 last_pymnt_amnt 3.216300e+00
There is massive multicollinearity. These features are all highly dependent on each other, because they are accounting/cash flow features related to loan repayment.
1. Drop Highly Redundant Features¶
Keep only 1 or 2 summary features instead of everything.
- Suggested to KEEP:
- last_pymnt_amnt ➔ amount of last payment (dynamic signal).
- out_prncp ➔ current outstanding principal balance.
- Suggested to DROP or not prioritize:
- total_pymnt, total_rec_prncp, total_rec_int, recoveries, total_rec_late_fee, collection_recovery_fee, total_pymnt_inv, out_prncp_inv ➔ all highly redundant and overlapping.
2. Create Aggregated Features (Optional)¶
We can keep more information without multicollinearity:
- Principal paid ratio: Proportion of principal repaid (good for default modeling).
df_inputs_prepr['principal_paid_ratio'] = df_inputs_prepr['total_rec_prncp'] / df_inputs_prepr['loan_amnt']
VIF analysis of the new set of features.¶
# List of the features you want to check
num_features = ['out_prncp', 'last_pymnt_amnt', 'principal_paid_ratio']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 principal_paid_ratio 1.739074 1 last_pymnt_amnt 1.737794 2 out_prncp 1.000991
The VIF values for all remaining engineered features are well below the threshold of 5, indicating low multicollinearity and suggesting that each variable provides unique and valuable information for modeling. This confirms the effectiveness of the feature engineering process in reducing redundancy while preserving predictive power.
9. FICO Scores¶
# List of the features you want to check
num_features = ['fico_range_low', 'fico_range_high', 'last_fico_range_high', 'last_fico_range_low']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 fico_range_high 1.505941e+07 1 fico_range_low 1.505512e+07 2 last_fico_range_high 2.315564e+02 3 last_fico_range_low 7.445950e+01
- Including both low and high versions of FICO scores causes massive redundancy.
- The model would have unstable coefficients, difficulty interpreting feature importance, and inflated errors.
1. Drop One of Each Pair¶
We do not need both low and high scores — they are highly correlated.
- Keep only one from the original FICO range (fico_range_high or fico_range_low).
- Keep only one from the last FICO range (last_fico_range_high or last_fico_range_low).
10. Credit Line & History Timelines¶
# List of the features you want to check
num_features = ['mths_since_earliest_cr_line', 'mths_since_issue_d', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op',
'mo_sin_rcnt_tl', 'mths_since_rcnt_il', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_revol_delinq']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 mths_since_earliest_cr_line 52.837152 1 mo_sin_old_rev_tl_op 25.293201 2 mths_since_issue_d 18.405433 3 mths_since_recent_bc_dlq 11.219079 4 mths_since_recent_revol_delinq 8.458308 5 mo_sin_old_il_acct 7.728318 6 mths_since_rcnt_il 5.044650 7 mo_sin_rcnt_rev_tl_op 3.561668 8 mo_sin_rcnt_tl 2.924126 9 mths_since_recent_bc 2.582552
Main Observations:¶
- The top 3 (mths_since_earliest_cr_line, mo_sin_old_rev_tl_op, mths_since_issue_d) have very high VIFs (>10) → strong multicollinearity.
- The others (bottom 7) have moderate multicollinearity (VIF ~2.5–5).
Why?
- Many of these are time features measuring similar things: age of accounts, recency of new accounts, recency of delinquencies, etc.
- Naturally, older credit history → older revolving accounts, older installment accounts, etc.
- Loan issue date is also highly tied to the borrower's credit age profile.
- Redundancy between "oldest" and "most recent" time features.
- Models can become unstable and overfit due to high correlation between these time measures.
Recommendations for Feature Engineering:¶
1. Drop Some Redundant "Oldest" Features
- These three features are very correlated: mths_since_earliest_cr_line, mo_sin_old_rev_tl_op, mo_sin_old_il_acct
- Suggestion:
- Keep only mths_since_earliest_cr_line (captures overall credit age).
- Drop mo_sin_old_rev_tl_op and mo_sin_old_il_acct.
2. Handle Loan Issue Date Carefully
- 'mths_since_issue_d': reflects the age of the loan. It might be useful, but it's strongly collinear with other "months since" features.
- Suggestion: drop it to reduce multicollinearity.
3. Keep Recency of Recent Activity These are different kinds of recency indicators:
- mths_since_recent_bc_dlq
- mths_since_recent_revol_delinq
- mths_since_rcnt_il
- mo_sin_rcnt_rev_tl_op
- mo_sin_rcnt_tl
- mths_since_recent_bc
Suggestion:
- Keep most of these for now — they capture recent delinquency, recent account opening, and recency of activity, all of which are important for credit risk.
- Later you can cluster or combine them if needed (e.g., minimum of all recent months as a "most recent event" feature).
VIF analysis of the new set of features.¶
# List of the features you want to check
num_features = ['mths_since_earliest_cr_line', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mths_since_rcnt_il',
'mths_since_recent_bc', 'mths_since_recent_revol_delinq']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 mths_since_earliest_cr_line 4.032080 1 mo_sin_rcnt_rev_tl_op 3.555260 2 mo_sin_rcnt_tl 2.902862 3 mths_since_rcnt_il 2.572720 4 mths_since_recent_bc 2.571783 5 mths_since_recent_revol_delinq 2.474972
The VIF values for all remaining engineered features are well below the threshold of 5, indicating low multicollinearity and suggesting that each variable provides unique and valuable information for modeling. This confirms the effectiveness of the feature engineering process in reducing redundancy while preserving predictive power.
11. Other / Miscellaneous¶
# List of the features you want to check
num_features = ['percent_bc_gt_75', 'pct_tl_nvr_dlq', 'tax_liens', 'pub_rec', 'pub_rec_bankruptcies', 'tot_coll_amt',
'mort_acc', 'months_since_last_pymnt', 'months_since_last_credit_pull', 'policy_code']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 policy_code 127.367693 1 pub_rec 7.787948 2 pub_rec_bankruptcies 4.551261 3 tax_liens 3.970360 4 months_since_last_pymnt 1.463992 5 months_since_last_credit_pull 1.453885 6 pct_tl_nvr_dlq 1.026930 7 tot_coll_amt 1.008896 8 mort_acc 1.005815 9 percent_bc_gt_75 1.004729
1. Keep Timing and Ratio Variables As-Is¶
pub_rec_bankruptcies, months_since_last_pymnt, months_since_last_credit_pull, pct_tl_nvr_dlq, tot_coll_amt, percent_bc_gt_75 and mort_acc are good and low-collinearity.
2. Assess Relationship Between Public Record Variables¶
- pub_rec, pub_rec_bankruptcies, tax_liens are all related to legal financial problems.
- Suggestion:
- Keep pub_rec_bankruptcies separately — bankruptcies have a big credit impact.
- Consider combining pub_rec and tax_liens to Create a single indicator and then drop them individually.
# Combining 'pub_rec' and 'tax_liens'
df_inputs_prepr['total_public_records'] = df_inputs_prepr['pub_rec'] + df_inputs_prepr['tax_liens']
3. Drop policy_code¶
It usually has only 1 unique value (1) in the Lending Club dataset ➔ Drop it immediately — adds no predictive value.
VIF analysis of the new set of features.¶
# List of the features you want to check
num_features = ['percent_bc_gt_75', 'pub_rec_bankruptcies', 'tot_coll_amt', 'mort_acc',
'months_since_last_credit_pull', 'total_public_records']
# Calculate and print Variance Inflation Factor (VIF)
print(calculate_vif(df_inputs_prepr, num_features))
feature VIF 0 months_since_last_credit_pull 2.443156 1 percent_bc_gt_75 2.144920 2 mort_acc 1.512708 3 pub_rec_bankruptcies 1.476033 4 total_public_records 1.444479 5 tot_coll_amt 1.010763
The VIF values for all remaining engineered features are well below the threshold of 5, indicating low multicollinearity and suggesting that each variable provides unique and valuable information for modeling. This confirms the effectiveness of the feature engineering process in reducing redundancy while preserving predictive power.
12. Feature Reduction Based on Multicollinearity¶
To enhance model stability and reduce redundancy, a Variance Inflation Factor (VIF) analysis was conducted on all numerical features. High VIF values indicate multicollinearity, which can distort model coefficients and impair generalization. Based on this analysis, 54 numerical features exhibiting strong multicollinearity (VIF > threshold) were identified and excluded from further modeling to retain only the most informative and independent predictors.
# List of features to drop.
feats_num_to_drop = ['funded_amnt', 'funded_amnt_inv', 'installment', 'mths_since_last_delinq', 'mths_since_last_record',
'mths_since_last_major_derog', 'acc_now_delinq', 'has_delinquency_now', 'last_record_bucket',
'delinq_record_combo', 'all_util', 'il_util', 'bc_util', 'revol_util', 'total_bc_limit',
'total_bal_ex_mort', 'tot_hi_cred_lim', 'open_acc', 'open_il_24m', 'open_rv_24m',
'num_sats', 'num_il_tl', 'num_rev_accts', 'num_actv_bc_tl', 'num_bc_sats',
'num_op_rev_tl', 'num_rev_tl_bal_gt_0', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv',
'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int',
'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'fico_range_low', 'last_fico_range_low',
'mo_sin_old_rev_tl_op', 'mo_sin_old_il_acct', 'mths_since_issue_d', 'mths_since_recent_bc_dlq',
'policy_code', 'pub_rec', 'tax_liens', 'months_since_last_pymnt', 'pct_tl_nvr_dlq']
print(len(feats_num_to_drop))
54
# Drop this set of features from the df_inputs_prepr dataframe.
df_inputs_prepr = df_inputs_prepr.drop(columns = feats_num_to_drop)
C. Engineering of Numerical Variables¶
Checking the list and the number of features after these preprocessing:¶
List_num_features = ['loan_amnt', 'term_int', 'int_rate', 'annual_inc', 'emp_length_int', 'dti', 'min_mths_since_delinquency',
'delinq_record_risk_score', 'delinq_2yrs', 'collections_12_mths_ex_med', 'chargeoff_within_12_mths',
'delinq_amnt', 'num_accts_ever_120_pd', 'num_tl_90g_dpd_24m', 'num_tl_120dpd_2m', 'num_tl_30dpd',
'revol_bal', 'total_bal_il', 'max_bal_bc', 'avg_cur_bal', 'bc_open_to_buy', 'revol_bal_to_bc_limit',
'revol_bal_to_open_to_buy', 'total_bal_ex_mort_to_inc', 'total_balance_to_credit_ratio', 'rev_to_il_limit_ratio',
'total_il_high_credit_limit', 'tot_cur_bal', 'total_acc', 'open_act_il', 'open_il_12m', 'num_actv_rev_tl',
'open_rv_12m', 'num_bc_tl', 'open_acc_6m', 'acc_open_past_24mths', 'total_cu_tl', 'inq_fi', 'inq_last_6mths',
'inq_last_12m', 'mths_since_recent_inq', 'out_prncp', 'last_pymnt_amnt', 'principal_paid_ratio',
'fico_range_high', 'last_fico_range_high', 'mths_since_earliest_cr_line', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl',
'mths_since_rcnt_il', 'mths_since_recent_bc', 'mths_since_recent_revol_delinq', 'percent_bc_gt_75',
'pub_rec_bankruptcies', 'tot_coll_amt', 'mort_acc', 'months_since_last_credit_pull', 'total_public_records']
print('number of features after preprocessing: ', len(List_num_features))
number of features after preprocessing: 58
We obtain now 58 features after the preprocessing of numerical variables instead of the original number of 94.
Classification of the numerical features into discrete or continuous:¶
Here's a "smart" automatic strategy for classifying features into discrete or continuous, not just based on n_unique, but combining:
- Number of unique values.
- Variance (dispersion).
- Data type (integer vs float).
Smart Strategy Logic:¶
1. If the feature is integer type:
- If unique values ≤ 15 → treat as discrete.
- Else → if variance is low → treat as discrete, otherwise continuous.
2. If the feature is float type:
- Always treat as continuous (except if very few unique values, like ≤ 5).
3. If the feature has extremely low variance (almost constant), treat it as discrete.
# Function to classify the features
def classify_feature(df, threshold_unique=15, threshold_variance=0.01):
discrete_features = []
continuous_features = []
for col in df.columns:
if np.issubdtype(df[col].dtype, np.number): # only numeric features
n_unique = df[col].nunique()
variance = df[col].var()
if np.issubdtype(df[col].dtype, np.integer):
if n_unique <= threshold_unique or variance < threshold_variance:
discrete_features.append(col)
else:
continuous_features.append(col)
else: # float
if n_unique <= 5 or variance < threshold_variance:
discrete_features.append(col)
else:
continuous_features.append(col)
return discrete_features, continuous_features
Why is this better?¶
- It adapts to your real data — not just the number of unique values blindly.
- It respects the nature of "counts" vs "proportions" vs "scores".
- It avoids misclassifying slightly continuous features with few categories.
Classification of the numerical features into discrete or continous¶
# Assuming df_inputs_prepr is your preprocessed dataset
df_inputs_prepr_class = df_inputs_prepr[List_num_features].copy()
discr_features, conti_features = classify_feature(df_inputs_prepr_class)
print("Discrete features:", discr_features)
print()
print("Continuous features:", conti_features)
Discrete features: ['term_int', 'delinq_record_risk_score', 'num_tl_120dpd_2m', 'num_tl_30dpd'] Continuous features: ['loan_amnt', 'int_rate', 'annual_inc', 'emp_length_int', 'dti', 'min_mths_since_delinquency', 'delinq_2yrs', 'collections_12_mths_ex_med', 'chargeoff_within_12_mths', 'delinq_amnt', 'num_accts_ever_120_pd', 'num_tl_90g_dpd_24m', 'revol_bal', 'total_bal_il', 'max_bal_bc', 'avg_cur_bal', 'bc_open_to_buy', 'revol_bal_to_bc_limit', 'revol_bal_to_open_to_buy', 'total_bal_ex_mort_to_inc', 'total_balance_to_credit_ratio', 'rev_to_il_limit_ratio', 'total_il_high_credit_limit', 'tot_cur_bal', 'total_acc', 'open_act_il', 'open_il_12m', 'num_actv_rev_tl', 'open_rv_12m', 'num_bc_tl', 'open_acc_6m', 'acc_open_past_24mths', 'total_cu_tl', 'inq_fi', 'inq_last_6mths', 'inq_last_12m', 'mths_since_recent_inq', 'out_prncp', 'last_pymnt_amnt', 'principal_paid_ratio', 'fico_range_high', 'last_fico_range_high', 'mths_since_earliest_cr_line', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mths_since_rcnt_il', 'mths_since_recent_bc', 'mths_since_recent_revol_delinq', 'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tot_coll_amt', 'mort_acc', 'months_since_last_credit_pull', 'total_public_records']
With this classification strategy we found:
- 4 discrete features.
- 54 continous features.
WoE and IV classification of features:¶
- We apply Weight of Evidence (WoE) transformation to numerical features to create a stronger, more interpretable relationship with the target.
- Information Value (IV) helps prioritize the most predictive variables for credit risk modeling.
Function to evaluate WoE and IV of a continuous variable¶
# WoE function for ordered discrete and continuous variables
def woe_ordered_continuous(df, discrete_variabe_name, good_bad_variable_df):
df = pd.concat([df[discrete_variabe_name], good_bad_variable_df], axis = 1)
df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
df = df.iloc[:, [0, 1, 3]]
df.columns = [df.columns.values[0], 'n_obs', 'prop_good']
df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()
df['n_good'] = df['prop_good'] * df['n_obs']
df['n_bad'] = (1 - df['prop_good']) * df['n_obs']
df['prop_n_good'] = df['n_good'] / df['n_good'].sum()
df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()
df['WoE'] = np.log1p(df['prop_n_good'] / df['prop_n_bad'])
#df = df.sort_values(['WoE'])
#df = df.reset_index(drop = True)
df['diff_prop_good'] = df['prop_good'].diff().abs()
df['diff_WoE'] = df['WoE'].diff().abs()
df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WoE']
df['IV'] = df['IV'].sum()
return df
# Here we define a function similar to the one above, ...
# ... with one slight difference: we order the results by the values of a different column.
# The function takes 3 arguments: a dataframe, a string, and a dataframe. The function returns a dataframe as a result.
Variable: 'term_int'¶
df_temp = woe_ordered_continuous(df_inputs_prepr, 'term_int', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| term_int | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 36.0 | 206971 | 0.171783 | 0.754724 | 35554.0 | 171417.0 | 0.604639 | 0.79569 | 0.565253 | NaN | NaN | 0.09772 |
| 1 | 60.0 | 67263 | 0.345628 | 0.245276 | 23248.0 | 44015.0 | 0.395361 | 0.20431 | 1.076741 | 0.173846 | 0.511488 | 0.09772 |
plot_by_woe(df_temp)
# We plot the weight of evidence values.
# Leave as is.
# '60' will be the reference category.
df_inputs_prepr['term:36'] = np.where((df_inputs_prepr['term_int'] == 36), 1, 0)
df_inputs_prepr['term:60'] = np.where((df_inputs_prepr['term_int'] == 60), 1, 0)
Variable: 'num_tl_120dpd_2m'¶
# 'num_tl_120dpd_2m'
df_inputs_prepr['num_tl_120dpd_2m'].unique()
# Has only 6 levels: from 0 to 6. Hence, we turn it into a factor with 6 levels.
array([0., 1., 2.])
df_temp = woe_ordered_continuous(df_inputs_prepr, 'num_tl_120dpd_2m', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| num_tl_120dpd_2m | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 274036 | 0.214435 | 0.999278 | 58763.0 | 215273.0 | 0.999337 | 0.999262 | 0.693185 | NaN | NaN | 0.000024 |
| 1 | 1.0 | 191 | 0.204188 | 0.000696 | 39.0 | 152.0 | 0.000663 | 0.000706 | 0.662701 | 0.010247 | 0.030484 | 0.000024 |
| 2 | 2.0 | 7 | 0.000000 | 0.000026 | 0.0 | 7.0 | 0.000000 | 0.000032 | 0.000000 | 0.204188 | 0.662701 | 0.000024 |
plot_by_woe(df_temp)
# We plot the weight of evidence values.
# We create the following categories: '0', '1', '2 - 6'
# '2-6' will be the reference category
df_inputs_prepr['num_tl_120dpd_2m:0'] = np.where(df_inputs_prepr['num_tl_120dpd_2m'].isin([0]), 1, 0)
df_inputs_prepr['num_tl_120dpd_2m:1'] = np.where(df_inputs_prepr['num_tl_120dpd_2m'].isin([1]), 1, 0)
df_inputs_prepr['num_tl_120dpd_2m:2-6'] = np.where(df_inputs_prepr['num_tl_120dpd_2m'].isin(range(2, 7)), 1, 0)
Variable: 'num_tl_30dpd'¶
# 'num_tl_30dpd'
df_inputs_prepr['num_tl_30dpd'].unique()
# Has only 6 levels: from 0 to 6. Hence, we turn it into a factor with 6 levels.
array([0., 2., 1., 3.])
df_temp = woe_ordered_continuous(df_inputs_prepr, 'num_tl_30dpd', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| num_tl_30dpd | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 273429 | 0.214330 | 0.997065 | 58604.0 | 214825.0 | 0.996633 | 0.997182 | 0.692872 | NaN | NaN | 0.000058 |
| 1 | 1.0 | 761 | 0.248357 | 0.002775 | 189.0 | 572.0 | 0.003214 | 0.002655 | 0.793243 | 0.034028 | 0.100371 | 0.000058 |
| 2 | 2.0 | 37 | 0.216216 | 0.000135 | 8.0 | 29.0 | 0.000136 | 0.000135 | 0.698469 | 0.032141 | 0.094774 | 0.000058 |
| 3 | 3.0 | 7 | 0.142857 | 0.000026 | 1.0 | 6.0 | 0.000017 | 0.000028 | 0.476616 | 0.073359 | 0.221853 | 0.000058 |
plot_by_woe(df_temp)
# We plot the weight of evidence values.
# We create the following categories: '0 - 2', '3', '4'
# '4' will be the reference category
df_inputs_prepr['num_tl_30dpd:0'] = np.where(df_inputs_prepr['num_tl_30dpd'].isin([0]), 1, 0)
df_inputs_prepr['num_tl_30dpd:1'] = np.where(df_inputs_prepr['num_tl_30dpd'].isin([1]), 1, 0)
df_inputs_prepr['num_tl_30dpd:2-4'] = np.where(df_inputs_prepr['num_tl_30dpd'].isin(range(2, 5)), 1, 0)
Variable: 'delinq_record_risk_score'¶
# 'num_tl_30dpd'
df_inputs_prepr['delinq_record_risk_score'].unique()
# Has only 6 levels: from 0 to 6. Hence, we turn it into a factor with 6 levels.
array([0, 2, 1, 4, 3, 5, 7, 6], dtype=int64)
df_temp = woe_ordered_continuous(df_inputs_prepr, 'delinq_record_risk_score', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| delinq_record_risk_score | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 256999 | 0.213304 | 0.937152 | 54819.0 | 202180.0 | 0.932264 | 0.938486 | 0.689827 | NaN | NaN | 0.00046 |
| 1 | 1 | 13401 | 0.230505 | 0.048867 | 3089.0 | 10312.0 | 0.052532 | 0.047867 | 0.740732 | 0.017201 | 0.050906 | 0.00046 |
| 2 | 2 | 1588 | 0.229849 | 0.005791 | 365.0 | 1223.0 | 0.006207 | 0.005677 | 0.738796 | 0.000656 | 0.001936 | 0.00046 |
| 3 | 3 | 987 | 0.245187 | 0.003599 | 242.0 | 745.0 | 0.004116 | 0.003458 | 0.783939 | 0.015339 | 0.045143 | 0.00046 |
| 4 | 4 | 1166 | 0.237564 | 0.004252 | 277.0 | 889.0 | 0.004711 | 0.004127 | 0.761531 | 0.007623 | 0.022408 | 0.00046 |
| 5 | 5 | 69 | 0.086957 | 0.000252 | 6.0 | 63.0 | 0.000102 | 0.000292 | 0.299306 | 0.150608 | 0.462225 | 0.00046 |
| 6 | 6 | 13 | 0.076923 | 0.000047 | 1.0 | 12.0 | 0.000017 | 0.000056 | 0.266438 | 0.010033 | 0.032868 | 0.00046 |
| 7 | 7 | 11 | 0.272727 | 0.000040 | 3.0 | 8.0 | 0.000051 | 0.000037 | 0.864527 | 0.195804 | 0.598088 | 0.00046 |
plot_by_woe(df_temp)
# We plot the weight of evidence values.
# We create the following categories: '0', '1 - 2', '3 - 4', '5 - 7'
# '5-7' will be the reference category
df_inputs_prepr['delinq_record_risk_score:0'] = np.where(df_inputs_prepr['delinq_record_risk_score'].isin([0]), 1, 0)
df_inputs_prepr['delinq_record_risk_score:1-2'] = np.where(df_inputs_prepr['delinq_record_risk_score'].isin(range(1, 3)), 1, 0)
df_inputs_prepr['delinq_record_risk_score:3-4'] = np.where(df_inputs_prepr['delinq_record_risk_score'].isin(range(3, 5)), 1, 0)
df_inputs_prepr['delinq_record_risk_score:5-7'] = np.where(df_inputs_prepr['delinq_record_risk_score'].isin(range(5, 8)), 1, 0)
Variable: 'loan_amnt'¶
# loan_amnt
df_inputs_prepr['loan_amnt_factor'] = pd.cut(df_inputs_prepr['loan_amnt'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'loan_amnt_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| loan_amnt_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (460.5, 1290.0] | 1742 | 0.128588 | 0.006352 | 224.0 | 1518.0 | 0.003809 | 0.007046 | 0.432187 | NaN | NaN | 0.022676 |
| 1 | (1290.0, 2080.0] | 3896 | 0.154517 | 0.014207 | 602.0 | 3294.0 | 0.010238 | 0.015290 | 0.512562 | 0.025930 | 0.080375 | 0.022676 |
| 2 | (2080.0, 2870.0] | 3583 | 0.149595 | 0.013065 | 536.0 | 3047.0 | 0.009115 | 0.014144 | 0.497425 | 0.004922 | 0.015136 | 0.022676 |
| 3 | (2870.0, 3660.0] | 7864 | 0.173194 | 0.028676 | 1362.0 | 6502.0 | 0.023162 | 0.030181 | 0.569536 | 0.023599 | 0.072111 | 0.022676 |
| 4 | (3660.0, 4450.0] | 6291 | 0.175648 | 0.022940 | 1105.0 | 5186.0 | 0.018792 | 0.024073 | 0.576970 | 0.002453 | 0.007434 | 0.022676 |
| 5 | (4450.0, 5240.0] | 14267 | 0.177402 | 0.052025 | 2531.0 | 11736.0 | 0.043043 | 0.054477 | 0.582280 | 0.001755 | 0.005310 | 0.022676 |
| 6 | (5240.0, 6030.0] | 13689 | 0.165827 | 0.049917 | 2270.0 | 11419.0 | 0.038604 | 0.053005 | 0.547144 | 0.011576 | 0.035136 | 0.022676 |
| 7 | (6030.0, 6820.0] | 4791 | 0.173868 | 0.017470 | 833.0 | 3958.0 | 0.014166 | 0.018372 | 0.571577 | 0.008041 | 0.024434 | 0.022676 |
| 8 | (6820.0, 7610.0] | 10612 | 0.168300 | 0.038697 | 1786.0 | 8826.0 | 0.030373 | 0.040969 | 0.554673 | 0.005568 | 0.016905 | 0.022676 |
| 9 | (7610.0, 8400.0] | 13440 | 0.180283 | 0.049009 | 2423.0 | 11017.0 | 0.041206 | 0.051139 | 0.590984 | 0.011983 | 0.036311 | 0.022676 |
| 10 | (8400.0, 9190.0] | 6964 | 0.170017 | 0.025394 | 1184.0 | 5780.0 | 0.020135 | 0.026830 | 0.559893 | 0.010266 | 0.031091 | 0.022676 |
| 11 | (9190.0, 9980.0] | 5397 | 0.201223 | 0.019680 | 1086.0 | 4311.0 | 0.018469 | 0.020011 | 0.653851 | 0.031206 | 0.093958 | 0.022676 |
| 12 | (9980.0, 10770.0] | 23726 | 0.210613 | 0.086517 | 4997.0 | 18729.0 | 0.084980 | 0.086937 | 0.681829 | 0.009390 | 0.027978 | 0.022676 |
| 13 | (10770.0, 11560.0] | 6664 | 0.230642 | 0.024300 | 1537.0 | 5127.0 | 0.026139 | 0.023799 | 0.741137 | 0.020029 | 0.059308 | 0.022676 |
| 14 | (11560.0, 12350.0] | 16752 | 0.211736 | 0.061087 | 3547.0 | 13205.0 | 0.060321 | 0.061295 | 0.685167 | 0.018906 | 0.055969 | 0.022676 |
| 15 | (12350.0, 13140.0] | 5349 | 0.217798 | 0.019505 | 1165.0 | 4184.0 | 0.019812 | 0.019421 | 0.703158 | 0.006062 | 0.017991 | 0.022676 |
| 16 | (13140.0, 13930.0] | 2814 | 0.264748 | 0.010261 | 745.0 | 2069.0 | 0.012670 | 0.009604 | 0.841227 | 0.046950 | 0.138068 | 0.022676 |
| 17 | (13930.0, 14720.0] | 7764 | 0.231968 | 0.028312 | 1801.0 | 5963.0 | 0.030628 | 0.027679 | 0.745047 | 0.032780 | 0.096180 | 0.022676 |
| 18 | (14720.0, 15510.0] | 16292 | 0.221581 | 0.059409 | 3610.0 | 12682.0 | 0.061392 | 0.058868 | 0.714364 | 0.010387 | 0.030682 | 0.022676 |
| 19 | (15510.0, 16300.0] | 9284 | 0.243968 | 0.033854 | 2265.0 | 7019.0 | 0.038519 | 0.032581 | 0.780359 | 0.022387 | 0.065994 | 0.022676 |
| 20 | (16300.0, 17090.0] | 4042 | 0.239485 | 0.014739 | 968.0 | 3074.0 | 0.016462 | 0.014269 | 0.767183 | 0.004483 | 0.013175 | 0.022676 |
| 21 | (17090.0, 17880.0] | 2133 | 0.280825 | 0.007778 | 599.0 | 1534.0 | 0.010187 | 0.007121 | 0.888140 | 0.041340 | 0.120957 | 0.022676 |
| 22 | (17880.0, 18670.0] | 7859 | 0.236035 | 0.028658 | 1855.0 | 6004.0 | 0.031547 | 0.027870 | 0.757030 | 0.044790 | 0.131110 | 0.022676 |
| 23 | (18670.0, 19460.0] | 3120 | 0.267949 | 0.011377 | 836.0 | 2284.0 | 0.014217 | 0.010602 | 0.850578 | 0.031914 | 0.093548 | 0.022676 |
| 24 | (19460.0, 20250.0] | 16121 | 0.238323 | 0.058786 | 3842.0 | 12279.0 | 0.065338 | 0.056997 | 0.763763 | 0.029626 | 0.086815 | 0.022676 |
| 25 | (20250.0, 21040.0] | 4870 | 0.244148 | 0.017759 | 1189.0 | 3681.0 | 0.020220 | 0.017087 | 0.780887 | 0.005825 | 0.017124 | 0.022676 |
| 26 | (21040.0, 21830.0] | 1549 | 0.300194 | 0.005648 | 465.0 | 1084.0 | 0.007908 | 0.005032 | 0.944528 | 0.056046 | 0.163641 | 0.022676 |
| 27 | (21830.0, 22620.0] | 2759 | 0.246829 | 0.010061 | 681.0 | 2078.0 | 0.011581 | 0.009646 | 0.788757 | 0.053365 | 0.155771 | 0.022676 |
| 28 | (22620.0, 23410.0] | 1818 | 0.246975 | 0.006629 | 449.0 | 1369.0 | 0.007636 | 0.006355 | 0.789186 | 0.000146 | 0.000429 | 0.022676 |
| 29 | (23410.0, 24200.0] | 7816 | 0.236566 | 0.028501 | 1849.0 | 5967.0 | 0.031445 | 0.027698 | 0.758593 | 0.010409 | 0.030593 | 0.022676 |
| 30 | (24200.0, 24990.0] | 1128 | 0.275709 | 0.004113 | 311.0 | 817.0 | 0.005289 | 0.003792 | 0.873225 | 0.039143 | 0.114632 | 0.022676 |
| 31 | (24990.0, 25780.0] | 7941 | 0.227049 | 0.028957 | 1803.0 | 6138.0 | 0.030662 | 0.028492 | 0.730532 | 0.048660 | 0.142693 | 0.022676 |
| 32 | (25780.0, 26570.0] | 1337 | 0.257292 | 0.004875 | 344.0 | 993.0 | 0.005850 | 0.004609 | 0.819424 | 0.030243 | 0.088892 | 0.022676 |
| 33 | (26570.0, 27360.0] | 1162 | 0.274527 | 0.004237 | 319.0 | 843.0 | 0.005425 | 0.003913 | 0.869776 | 0.017234 | 0.050352 | 0.022676 |
| 34 | (27360.0, 28150.0] | 4685 | 0.210672 | 0.017084 | 987.0 | 3698.0 | 0.016785 | 0.017166 | 0.682006 | 0.063854 | 0.187770 | 0.022676 |
| 35 | (28150.0, 28940.0] | 689 | 0.256894 | 0.002512 | 177.0 | 512.0 | 0.003010 | 0.002377 | 0.818258 | 0.046222 | 0.136252 | 0.022676 |
| 36 | (28940.0, 29730.0] | 821 | 0.258222 | 0.002994 | 212.0 | 609.0 | 0.003605 | 0.002827 | 0.822143 | 0.001328 | 0.003886 | 0.022676 |
| 37 | (29730.0, 30520.0] | 6437 | 0.275128 | 0.023473 | 1771.0 | 4666.0 | 0.030118 | 0.021659 | 0.871531 | 0.016906 | 0.049387 | 0.022676 |
| 38 | (30520.0, 31310.0] | 644 | 0.296584 | 0.002348 | 191.0 | 453.0 | 0.003248 | 0.002103 | 0.934026 | 0.021456 | 0.062495 | 0.022676 |
| 39 | (31310.0, 32100.0] | 1569 | 0.284895 | 0.005721 | 447.0 | 1122.0 | 0.007602 | 0.005208 | 0.899997 | 0.011689 | 0.034028 | 0.022676 |
| 40 | (32100.0, 32890.0] | 476 | 0.319328 | 0.001736 | 152.0 | 324.0 | 0.002585 | 0.001504 | 1.000178 | 0.034433 | 0.100181 | 0.022676 |
| 41 | (32890.0, 33680.0] | 771 | 0.258106 | 0.002811 | 199.0 | 572.0 | 0.003384 | 0.002655 | 0.821806 | 0.061221 | 0.178372 | 0.022676 |
| 42 | (33680.0, 34470.0] | 382 | 0.324607 | 0.001393 | 124.0 | 258.0 | 0.002109 | 0.001198 | 1.015535 | 0.066501 | 0.193729 | 0.022676 |
| 43 | (34470.0, 35260.0] | 10860 | 0.262523 | 0.039601 | 2851.0 | 8009.0 | 0.048485 | 0.037176 | 0.834724 | 0.062084 | 0.180811 | 0.022676 |
| 44 | (35260.0, 36050.0] | 329 | 0.273556 | 0.001200 | 90.0 | 239.0 | 0.001531 | 0.001109 | 0.866945 | 0.011033 | 0.032221 | 0.022676 |
| 45 | (36050.0, 36840.0] | 35 | 0.257143 | 0.000128 | 9.0 | 26.0 | 0.000153 | 0.000121 | 0.818986 | 0.016413 | 0.047959 | 0.022676 |
| 46 | (36840.0, 37630.0] | 53 | 0.207547 | 0.000193 | 11.0 | 42.0 | 0.000187 | 0.000195 | 0.672708 | 0.049596 | 0.146278 | 0.022676 |
| 47 | (37630.0, 38420.0] | 72 | 0.250000 | 0.000263 | 18.0 | 54.0 | 0.000306 | 0.000251 | 0.798060 | 0.042453 | 0.125352 | 0.022676 |
| 48 | (38420.0, 39210.0] | 37 | 0.216216 | 0.000135 | 8.0 | 29.0 | 0.000136 | 0.000135 | 0.698469 | 0.033784 | 0.099591 | 0.022676 |
| 49 | (39210.0, 40000.0] | 1538 | 0.283485 | 0.005608 | 436.0 | 1102.0 | 0.007415 | 0.005115 | 0.895890 | 0.067269 | 0.197422 | 0.022676 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories:
# < 2500, 2500 - 6500, 6500 - 9500, 9500 - 10800, 10800 - 17500, 17500 - 28500, >= 28500.
df_inputs_prepr['loan_amnt:<2500'] = np.where((df_inputs_prepr['loan_amnt'] <= 2500.), 1, 0)
df_inputs_prepr['loan_amnt:2500-6500'] = np.where((df_inputs_prepr['loan_amnt'] > 2500.) & (df_inputs_prepr['loan_amnt'] <= 6500.), 1, 0)
df_inputs_prepr['loan_amnt:6500-9500'] = np.where((df_inputs_prepr['loan_amnt'] > 6500.) & (df_inputs_prepr['loan_amnt'] <= 9500.), 1, 0)
df_inputs_prepr['loan_amnt:9500-11000'] = np.where((df_inputs_prepr['loan_amnt'] > 9500.) & (df_inputs_prepr['loan_amnt'] <= 10800.), 1, 0)
df_inputs_prepr['loan_amnt:11000-17500'] = np.where((df_inputs_prepr['loan_amnt'] > 10800.) & (df_inputs_prepr['loan_amnt'] <= 17500.), 1, 0)
df_inputs_prepr['loan_amnt:17500-28500'] = np.where((df_inputs_prepr['loan_amnt'] > 17500.) & (df_inputs_prepr['loan_amnt'] <= 28500.), 1, 0)
df_inputs_prepr['loan_amnt:>=28500'] = np.where((df_inputs_prepr['loan_amnt'] > 28500.), 1, 0)
# Drop 'loan_amnt_factor' feature
df_inputs_prepr = df_inputs_prepr.drop(columns = ['loan_amnt_factor'])
Variable: 'int_rate'¶
# unique values of 'int_rate'
df_inputs_prepr['int_rate'].unique()
array([13.99, 18.06, 12.29, 7.89, 21.6 , 16.29, 14.33, 14.64, 14.49,
24.99, 18.49, 10.15, 5.32, 13.49, 19.03, 8.18, 9.67, 12.49,
15.61, 21.97, 11.53, 16.99, 9.99, 6.92, 16.02, 6.03, 10.91,
12.12, 13.35, 6.97, 7.9 , 7.49, 7.62, 8.19, 11.49, 11.47,
10.64, 8.9 , 19.99, 14.99, 10.99, 8.49, 13.53, 19.24, 9.16,
8.67, 11.99, 17.86, 30.75, 9.44, 12.69, 13.11, 12.74, 16.01,
14.46, 11.55, 14.16, 20.2 , 15.1 , 18.99, 15.59, 21.49, 13.59,
11.67, 9.17, 26.49, 8.99, 7.97, 9.91, 13.67, 7.39, 19.52,
8.24, 28.67, 19.97, 9.93, 8.39, 14.08, 6.49, 12.99, 13.18,
6.62, 16.46, 17.57, 14.31, 15.49, 7.91, 25.69, 20.99, 9.76,
18.25, 7.35, 13.65, 23.28, 6.72, 14.03, 17.09, 12.62, 12.79,
17.27, 7.69, 9.75, 17.99, 10.38, 7.26, 22.47, 10.75, 7.84,
16.55, 9.92, 10.49, 13.66, 14.65, 18.55, 13.98, 7.99, 9.8 ,
15.31, 24.74, 30.79, 24.7 , 11.22, 21.48, 14.47, 11.86, 15.41,
11.97, 15.88, 13.33, 16.24, 17.1 , 12.18, 12.35, 6.24, 8.38,
20.31, 14.09, 23.99, 20.89, 18.24, 7.34, 7.02, 19.05, 21.99,
19.19, 10.42, 11.14, 6.91, 15.77, 17.56, 6.08, 7.46, 16.2 ,
6.68, 30.99, 21. , 29.69, 9.58, 22.35, 11.39, 7.51, 22.15,
19.48, 17.58, 20.39, 13.06, 15.05, 15.8 , 24.84, 6.89, 20.5 ,
6.39, 11.71, 6. , 12.05, 18.75, 24.49, 14.98, 11.98, 20. ,
13.44, 22.95, 30.84, 18.84, 10. , 25.89, 19.53, 22.74, 24.5 ,
12.42, 9.49, 23.7 , 11.06, 24.85, 12.73, 10.56, 15.95, 22.39,
26.77, 14.26, 26.57, 22.99, 8.59, 10.16, 18.54, 11.44, 10.78,
25.81, 17.76, 15.99, 5.93, 19.22, 6.71, 5.99, 19.42, 7.07,
30.17, 12.59, 21.45, 19.72, 17.77, 21.67, 22.9 , 5.31, 17.47,
7.12, 18.2 , 16.14, 22.7 , 15.22, 13.85, 20.25, 10.41, 10.08,
27.27, 12.88, 21.36, 21.85, 7.66, 16.7 , 7.14, 28.69, 25.29,
20.75, 20.49, 29.49, 15.02, 14.3 , 7.37, 6.99, 12.39, 16.59,
19.89, 6.07, 11.48, 22.4 , 14.84, 16.49, 13.68, 30.94, 15.27,
7.24, 12.98, 8.94, 14.07, 19.47, 12.85, 6.54, 14.48, 23.43,
9.71, 7.21, 10.74, 13.56, 13.72, 19.16, 19.2 , 12.61, 17.14,
25.99, 7.96, 8.6 , 7.74, 13.05, 26.24, 11.11, 9.43, 17.49,
26.06, 21.18, 14.85, 20.8 , 21.7 , 28.99, 26.3 , 17.97, 19.29,
23.1 , 22.45, 13.57, 18.85, 28.72, 27.31, 25.82, 7.88, 27.34,
17.93, 21.98, 18.45, 11.26, 23.76, 23.26, 18.92, 15.04, 13.8 ,
14.52, 14.17, 30.89, 16.45, 28.49, 16.77, 23.88, 23.63, 15.23,
10.47, 24.37, 25.78, 25.09, 28.88, 22.2 , 12.13, 12.84, 9.63,
25.57, 25.49, 23.5 , 8.08, 14.27, 7.59, 22.78, 24.08, 13.61,
21.28, 18.94, 21.15, 19.69, 25.83, 25.88, 29.99, 23.13, 16.69,
30.65, 27.79, 6.19, 16.91, 8.32, 25.28, 15.96, 13.58, 12.87,
19.92, 10.37, 11.36, 13.16, 18.64, 8.46, 23.4 , 15.21, 7.29,
19.13, 15.81, 26.99, 24.89, 16.33, 30.49, 12.68, 16.78, 7.4 ,
9.45, 8.88, 6.67, 22.91, 10.95, 16. , 10.59, 25.8 , 12.53,
13.23, 19.79, 23.32, 24.11, 9.25, 14.22, 14.91, 6.17, 25.34,
13.24, 11.05, 7.56, 5.79, 27.49, 9.88, 13.92, 24.24, 24.83,
5.42, 10.9 , 25.44, 11.89, 13.79, 15.68, 11.8 , 9.33, 10.65,
16.32, 14.35, 15.62, 20.9 , 14.72, 11.83, 23.87, 18.79, 26.14,
28.18, 12.92, 15.2 , 19.91, 6.83, 20.62, 23.91, 26.31, 17.43,
16.4 , 15.28, 10.33, 17.46, 10.62, 15.7 , 14.11, 10.07, 30.74,
10.36, 16.95, 12.23, 6.11, 10.25, 9.07, 28.14, 11.12, 11.91,
23.59, 9.32, 15.33, 9.2 , 11.34, 20.16, 12.09, 17.8 , 14.42,
25.11, 24.33, 10.28, 27.88, 17.39, 9.62, 13.48, 15.57, 23.83,
27.99, 18.39, 14.79, 13.22, 14.59, 14.83, 14.54, 13.47, 8.81,
17.74, 28.34, 8. , 16.89, 6.46, 9.7 , 14.96, 16.35, 11.28,
14.74, 9.83, 18.3 , 13.43, 23.33, 6.76, 11.66, 10.72, 15.65,
7.68, 12.41, 11.58, 17.88, 20.3 , 16.82, 15.58, 10.39, 17.51,
7.43, 10.71, 19.82, 10.83, 11.63, 13.04, 19.74, 17.19, 22.11,
11.31, 18.67, 11.46, 12.21, 16.07, 18.78, 22.06, 8.63, 11.54,
19.41, 11.03, 19.36, 18.62, 9.64, 21.59, 24.2 , 29.67, 15.37,
29.96, 20.48, 19.04, 23.52, 10.14, 11.59, 22.48, 13.3 , 7.75,
17.06, 8.7 , 25.65, 18.72, 20.11, 10.51, 14.61, 15.13, 18.91,
11.09, 17.04, 20.53, 19.66, 12.72, 13.17, 21.74, 16.63, 10.46,
24.52, 18.07, 7.05, 8.07, 12.54, 12.8 , 18.43, 14.75, 12.22,
18.17, 21.14, 20.85, 14.18, 10.96, 20.03, 16.08, 11.78, 13.75,
15.45, 9.38, 11.41, 12.36, 9.01, 20.77, 21.27, 14.93, 23.22,
13.55, 24.76, 12.67, 17.26, 17.34, 16.11, 12.04, 12.17, 7.42,
12.86, 14.82, 21.82, 13.62, 10.2 , 17.15, 9.51])
# loan_amnt
df_inputs_prepr['int_rate_factor'] = pd.cut(df_inputs_prepr['int_rate'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'int_rate_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| int_rate_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (5.284, 5.824] | 6219 | 0.039878 | 0.022678 | 248.0 | 5971.0 | 0.004218 | 0.027716 | 0.141645 | NaN | NaN | 0.209145 |
| 1 | (5.824, 6.337] | 4802 | 0.036443 | 0.017511 | 175.0 | 4627.0 | 0.002976 | 0.021478 | 0.129770 | 0.003435 | 0.011876 | 0.209145 |
| 2 | (6.337, 6.851] | 4886 | 0.055055 | 0.017817 | 269.0 | 4617.0 | 0.004575 | 0.021431 | 0.193473 | 0.018612 | 0.063704 | 0.209145 |
| 3 | (6.851, 7.364] | 9903 | 0.070181 | 0.036111 | 695.0 | 9208.0 | 0.011819 | 0.042742 | 0.244143 | 0.015125 | 0.050670 | 0.209145 |
| 4 | (7.364, 7.878] | 6047 | 0.078551 | 0.022051 | 475.0 | 5572.0 | 0.008078 | 0.025864 | 0.271797 | 0.008371 | 0.027654 | 0.209145 |
| 5 | (7.878, 8.392] | 18415 | 0.100081 | 0.067151 | 1843.0 | 16572.0 | 0.031342 | 0.076925 | 0.341776 | 0.021530 | 0.069979 | 0.209145 |
| 6 | (8.392, 8.905] | 4307 | 0.088925 | 0.015706 | 383.0 | 3924.0 | 0.006513 | 0.018215 | 0.305713 | 0.011156 | 0.036063 | 0.209145 |
| 7 | (8.905, 9.419] | 7881 | 0.118513 | 0.028738 | 934.0 | 6947.0 | 0.015884 | 0.032247 | 0.400499 | 0.029588 | 0.094787 | 0.209145 |
| 8 | (9.419, 9.932] | 9372 | 0.143833 | 0.034175 | 1348.0 | 8024.0 | 0.022924 | 0.037246 | 0.479635 | 0.025320 | 0.079136 | 0.209145 |
| 9 | (9.932, 10.446] | 8276 | 0.137869 | 0.030179 | 1141.0 | 7135.0 | 0.019404 | 0.033119 | 0.461140 | 0.005964 | 0.018494 | 0.209145 |
| 10 | (10.446, 10.96] | 8089 | 0.173445 | 0.029497 | 1403.0 | 6686.0 | 0.023860 | 0.031035 | 0.570297 | 0.035577 | 0.109157 | 0.209145 |
| 11 | (10.96, 11.473] | 14962 | 0.154458 | 0.054559 | 2311.0 | 12651.0 | 0.039301 | 0.058724 | 0.512379 | 0.018987 | 0.057919 | 0.209145 |
| 12 | (11.473, 11.987] | 10049 | 0.171161 | 0.036644 | 1720.0 | 8329.0 | 0.029251 | 0.038662 | 0.563368 | 0.016703 | 0.050989 | 0.209145 |
| 13 | (11.987, 12.5] | 15824 | 0.173913 | 0.057703 | 2752.0 | 13072.0 | 0.046801 | 0.060678 | 0.571715 | 0.002752 | 0.008347 | 0.209145 |
| 14 | (12.5, 13.014] | 15417 | 0.214893 | 0.056218 | 3313.0 | 12104.0 | 0.056342 | 0.056185 | 0.694542 | 0.040980 | 0.122827 | 0.209145 |
| 15 | (13.014, 13.528] | 10572 | 0.209894 | 0.038551 | 2219.0 | 8353.0 | 0.037737 | 0.038773 | 0.679692 | 0.004999 | 0.014850 | 0.209145 |
| 16 | (13.528, 14.041] | 14814 | 0.239503 | 0.054020 | 3548.0 | 11266.0 | 0.060338 | 0.052295 | 0.767236 | 0.029609 | 0.087544 | 0.209145 |
| 17 | (14.041, 14.555] | 10999 | 0.240204 | 0.040108 | 2642.0 | 8357.0 | 0.044930 | 0.038792 | 0.769296 | 0.000700 | 0.002060 | 0.209145 |
| 18 | (14.555, 15.068] | 9532 | 0.267520 | 0.034759 | 2550.0 | 6982.0 | 0.043366 | 0.032409 | 0.849325 | 0.027316 | 0.080030 | 0.209145 |
| 19 | (15.068, 15.582] | 4554 | 0.228371 | 0.016606 | 1040.0 | 3514.0 | 0.017686 | 0.016311 | 0.734433 | 0.039149 | 0.114892 | 0.209145 |
| 20 | (15.582, 16.096] | 12273 | 0.280535 | 0.044754 | 3443.0 | 8830.0 | 0.058552 | 0.040987 | 0.887293 | 0.052164 | 0.152860 | 0.209145 |
| 21 | (16.096, 16.609] | 7214 | 0.285694 | 0.026306 | 2061.0 | 5153.0 | 0.035050 | 0.023919 | 0.902326 | 0.005160 | 0.015033 | 0.209145 |
| 22 | (16.609, 17.123] | 6386 | 0.313811 | 0.023287 | 2004.0 | 4382.0 | 0.034080 | 0.020341 | 0.984135 | 0.028117 | 0.081808 | 0.209145 |
| 23 | (17.123, 17.636] | 6197 | 0.306923 | 0.022597 | 1902.0 | 4295.0 | 0.032346 | 0.019937 | 0.964101 | 0.006889 | 0.020034 | 0.209145 |
| 24 | (17.636, 18.15] | 5692 | 0.345397 | 0.020756 | 1966.0 | 3726.0 | 0.033434 | 0.017295 | 1.076067 | 0.038474 | 0.111966 | 0.209145 |
| 25 | (18.15, 18.664] | 6027 | 0.338477 | 0.021978 | 2040.0 | 3987.0 | 0.034693 | 0.018507 | 1.055904 | 0.006920 | 0.020163 | 0.209145 |
| 26 | (18.664, 19.177] | 5708 | 0.345480 | 0.020814 | 1972.0 | 3736.0 | 0.033536 | 0.017342 | 1.076309 | 0.007003 | 0.020405 | 0.209145 |
| 27 | (19.177, 19.691] | 3499 | 0.368963 | 0.012759 | 1291.0 | 2208.0 | 0.021955 | 0.010249 | 1.144900 | 0.023483 | 0.068592 | 0.209145 |
| 28 | (19.691, 20.204] | 4606 | 0.382327 | 0.016796 | 1761.0 | 2845.0 | 0.029948 | 0.013206 | 1.184102 | 0.013365 | 0.039202 | 0.209145 |
| 29 | (20.204, 20.718] | 1330 | 0.328571 | 0.004850 | 437.0 | 893.0 | 0.007432 | 0.004145 | 1.027069 | 0.053756 | 0.157033 | 0.209145 |
| 30 | (20.718, 21.232] | 2649 | 0.392601 | 0.009660 | 1040.0 | 1609.0 | 0.017686 | 0.007469 | 1.214341 | 0.064030 | 0.187273 | 0.209145 |
| 31 | (21.232, 21.745] | 2271 | 0.393219 | 0.008281 | 893.0 | 1378.0 | 0.015187 | 0.006396 | 1.216163 | 0.000618 | 0.001822 | 0.209145 |
| 32 | (21.745, 22.259] | 1727 | 0.413434 | 0.006298 | 714.0 | 1013.0 | 0.012142 | 0.004702 | 1.276005 | 0.020215 | 0.059842 | 0.209145 |
| 33 | (22.259, 22.772] | 1798 | 0.397108 | 0.006556 | 714.0 | 1084.0 | 0.012142 | 0.005032 | 1.227640 | 0.016326 | 0.048365 | 0.209145 |
| 34 | (22.772, 23.286] | 1345 | 0.439405 | 0.004905 | 591.0 | 754.0 | 0.010051 | 0.003500 | 1.353685 | 0.042297 | 0.126045 | 0.209145 |
| 35 | (23.286, 23.8] | 894 | 0.359060 | 0.003260 | 321.0 | 573.0 | 0.005459 | 0.002660 | 1.115938 | 0.080345 | 0.237747 | 0.209145 |
| 36 | (23.8, 24.313] | 1836 | 0.456972 | 0.006695 | 839.0 | 997.0 | 0.014268 | 0.004628 | 1.406852 | 0.097911 | 0.290914 | 0.209145 |
| 37 | (24.313, 24.827] | 1043 | 0.410355 | 0.003803 | 428.0 | 615.0 | 0.007279 | 0.002855 | 1.266859 | 0.046617 | 0.139993 | 0.209145 |
| 38 | (24.827, 25.34] | 1562 | 0.484635 | 0.005696 | 757.0 | 805.0 | 0.012874 | 0.003737 | 1.491831 | 0.074280 | 0.224972 | 0.209145 |
| 39 | (25.34, 25.854] | 1367 | 0.455011 | 0.004985 | 622.0 | 745.0 | 0.010578 | 0.003458 | 1.400889 | 0.029624 | 0.090942 | 0.209145 |
| 40 | (25.854, 26.368] | 1081 | 0.474561 | 0.003942 | 513.0 | 568.0 | 0.008724 | 0.002637 | 1.460689 | 0.019550 | 0.059799 | 0.209145 |
| 41 | (26.368, 26.881] | 362 | 0.569061 | 0.001320 | 206.0 | 156.0 | 0.003503 | 0.000724 | 1.764378 | 0.094500 | 0.303690 | 0.209145 |
| 42 | (26.881, 27.395] | 300 | 0.493333 | 0.001094 | 148.0 | 152.0 | 0.002517 | 0.000706 | 1.518916 | 0.075727 | 0.245462 | 0.209145 |
| 43 | (27.395, 27.908] | 186 | 0.612903 | 0.000678 | 114.0 | 72.0 | 0.001939 | 0.000334 | 1.917045 | 0.119570 | 0.398129 | 0.209145 |
| 44 | (27.908, 28.422] | 114 | 0.535088 | 0.000416 | 61.0 | 53.0 | 0.001037 | 0.000246 | 1.651864 | 0.077816 | 0.265181 | 0.209145 |
| 45 | (28.422, 28.936] | 452 | 0.522124 | 0.001648 | 236.0 | 216.0 | 0.004013 | 0.001003 | 1.610021 | 0.012964 | 0.041843 | 0.209145 |
| 46 | (28.936, 29.449] | 52 | 0.596154 | 0.000190 | 31.0 | 21.0 | 0.000527 | 0.000097 | 1.857594 | 0.074030 | 0.247573 | 0.209145 |
| 47 | (29.449, 29.963] | 248 | 0.552419 | 0.000904 | 137.0 | 111.0 | 0.002330 | 0.000515 | 1.708712 | 0.043734 | 0.148881 | 0.209145 |
| 48 | (29.963, 30.476] | 228 | 0.429825 | 0.000831 | 98.0 | 130.0 | 0.001667 | 0.000603 | 1.324912 | 0.122595 | 0.383800 | 0.209145 |
| 49 | (30.476, 30.99] | 867 | 0.522491 | 0.003162 | 453.0 | 414.0 | 0.007704 | 0.001922 | 1.611199 | 0.092667 | 0.286287 | 0.209145 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories:
# < 8, 8 - 12.5, 12.5 - 16.5, 16.5 - 20, 20 - 23.5, >= 23.5
df_inputs_prepr['int_rate:<=8'] = np.where((df_inputs_prepr['int_rate'] <= 8.0), 1, 0)
df_inputs_prepr['int_rate:8-12.5'] = np.where((df_inputs_prepr['int_rate'] > 8.0) & (df_inputs_prepr['int_rate'] <= 12.5), 1, 0)
df_inputs_prepr['int_rate:12.5-16.5'] = np.where((df_inputs_prepr['int_rate'] > 12.5) & (df_inputs_prepr['int_rate'] <= 16.5), 1, 0)
df_inputs_prepr['int_rate:16.5-20'] = np.where((df_inputs_prepr['int_rate'] > 16.5) & (df_inputs_prepr['int_rate'] <= 20.0), 1, 0)
df_inputs_prepr['int_rate:20-23.5'] = np.where((df_inputs_prepr['int_rate'] > 20.0) & (df_inputs_prepr['int_rate'] <= 23.5), 1, 0)
df_inputs_prepr['int_rate:>23.5'] = np.where((df_inputs_prepr['int_rate'] > 23.5), 1, 0)
# Drop 'loan_amnt_factor' feature
df_inputs_prepr = df_inputs_prepr.drop(columns = ['int_rate_factor'])
Variable: 'annual_inc'¶
# annual_inc
df_inputs_prepr['annual_inc_factor'] = pd.cut(df_inputs_prepr['annual_inc'], 75)
# Here we do fine-classing: using the 'cut' method, we split the variable into 100 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'annual_inc_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| annual_inc_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-9500.0, 126666.667] | 247646 | 0.219491 | 0.903046 | 54356.0 | 193290.0 | 0.924390 | 0.897220 | 0.708175 | NaN | NaN | inf |
| 1 | (126666.667, 253333.333] | 23822 | 0.169801 | 0.086867 | 4045.0 | 19777.0 | 0.068790 | 0.091802 | 0.559236 | 0.049690 | 0.148939 | inf |
| 2 | (253333.333, 380000.0] | 1839 | 0.150625 | 0.006706 | 277.0 | 1562.0 | 0.004711 | 0.007251 | 0.500597 | 0.019176 | 0.058639 | inf |
| 3 | (380000.0, 506666.667] | 537 | 0.111732 | 0.001958 | 60.0 | 477.0 | 0.001020 | 0.002214 | 0.379012 | 0.038893 | 0.121585 | inf |
| 4 | (506666.667, 633333.333] | 174 | 0.149425 | 0.000634 | 26.0 | 148.0 | 0.000442 | 0.000687 | 0.496901 | 0.037693 | 0.117889 | inf |
| 5 | (633333.333, 760000.0] | 78 | 0.166667 | 0.000284 | 13.0 | 65.0 | 0.000221 | 0.000302 | 0.549702 | 0.017241 | 0.052801 | inf |
| 6 | (760000.0, 886666.667] | 38 | 0.210526 | 0.000139 | 8.0 | 30.0 | 0.000136 | 0.000139 | 0.681572 | 0.043860 | 0.131870 | inf |
| 7 | (886666.667, 1013333.333] | 39 | 0.179487 | 0.000142 | 7.0 | 32.0 | 0.000119 | 0.000149 | 0.588581 | 0.031039 | 0.092990 | inf |
| 8 | (1013333.333, 1140000.0] | 15 | 0.133333 | 0.000055 | 2.0 | 13.0 | 0.000034 | 0.000060 | 0.447019 | 0.046154 | 0.141563 | inf |
| 9 | (1140000.0, 1266666.667] | 9 | 0.111111 | 0.000033 | 1.0 | 8.0 | 0.000017 | 0.000037 | 0.377039 | 0.022222 | 0.069980 | inf |
| 10 | (1266666.667, 1393333.333] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.111111 | 0.377039 | inf |
| 11 | (1393333.333, 1520000.0] | 7 | 0.285714 | 0.000026 | 2.0 | 5.0 | 0.000034 | 0.000023 | 0.902384 | 0.285714 | 0.902384 | inf |
| 12 | (1520000.0, 1646666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 13 | (1646666.667, 1773333.333] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 14 | (1773333.333, 1900000.0] | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.000000 | 0.000000 | inf |
| 15 | (1900000.0, 2026666.667] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | inf |
| 16 | (2026666.667, 2153333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 17 | (2153333.333, 2280000.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 18 | (2280000.0, 2406666.667] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | NaN | NaN | inf |
| 19 | (2406666.667, 2533333.333] | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.000000 | 0.000000 | inf |
| 20 | (2533333.333, 2660000.0] | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 21 | (2660000.0, 2786666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 22 | (2786666.667, 2913333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 23 | (2913333.333, 3040000.0] | 3 | 0.666667 | 0.000011 | 2.0 | 1.0 | 0.000034 | 0.000005 | 2.119548 | NaN | NaN | inf |
| 24 | (3040000.0, 3166666.667] | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.166667 | 0.579742 | inf |
| 25 | (3166666.667, 3293333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 26 | (3293333.333, 3420000.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 27 | (3420000.0, 3546666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 28 | (3546666.667, 3673333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 29 | (3673333.333, 3800000.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 30 | (3800000.0, 3926666.667] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 31 | (3926666.667, 4053333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 32 | (4053333.333, 4180000.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 33 | (4180000.0, 4306666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 34 | (4306666.667, 4433333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 35 | (4433333.333, 4560000.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 36 | (4560000.0, 4686666.667] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 37 | (4686666.667, 4813333.333] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 38 | (4813333.333, 4940000.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 39 | (4940000.0, 5066666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 40 | (5066666.667, 5193333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 41 | (5193333.333, 5320000.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 42 | (5320000.0, 5446666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 43 | (5446666.667, 5573333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 44 | (5573333.333, 5700000.0] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 45 | (5700000.0, 5826666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 46 | (5826666.667, 5953333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 47 | (5953333.333, 6080000.0] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 48 | (6080000.0, 6206666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 49 | (6206666.667, 6333333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 50 | (6333333.333, 6460000.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 51 | (6460000.0, 6586666.667] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 52 | (6586666.667, 6713333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 53 | (6713333.333, 6840000.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 54 | (6840000.0, 6966666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 55 | (6966666.667, 7093333.333] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | NaN | NaN | inf |
| 56 | (7093333.333, 7220000.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 57 | (7220000.0, 7346666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 58 | (7346666.667, 7473333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 59 | (7473333.333, 7600000.0] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 60 | (7600000.0, 7726666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 61 | (7726666.667, 7853333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 62 | (7853333.333, 7980000.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 63 | (7980000.0, 8106666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 64 | (8106666.667, 8233333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 65 | (8233333.333, 8360000.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 66 | (8360000.0, 8486666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 67 | (8486666.667, 8613333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 68 | (8613333.333, 8740000.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 69 | (8740000.0, 8866666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 70 | (8866666.667, 8993333.333] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 71 | (8993333.333, 9120000.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 72 | (9120000.0, 9246666.667] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 73 | (9246666.667, 9373333.333] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 74 | (9373333.333, 9500000.0] | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
# Initial examination shows that there are too few individuals with large income and too many with small income.
# Hence, we are going to have one category for more than 140K, and we are going to apply our approach to determine
# the categories of everyone with 140k or less.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['annual_inc'] <= 140000., : ]
#loan_data_temp = loan_data_temp.reset_index(drop = True)
#df_inputs_prepr_temp
df_inputs_prepr_temp["annual_inc_factor"] = pd.cut(df_inputs_prepr_temp['annual_inc'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'annual_inc_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3753315463.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp["annual_inc_factor"] = pd.cut(df_inputs_prepr_temp['annual_inc'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| annual_inc_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-140.0, 2800.0] | 83 | 0.240964 | 0.000325 | 20.0 | 63.0 | 0.000360 | 0.000316 | 0.760526 | NaN | NaN | 0.010553 |
| 1 | (2800.0, 5600.0] | 31 | 0.354839 | 0.000122 | 11.0 | 20.0 | 0.000198 | 0.000100 | 1.089912 | 0.113875 | 0.329386 | 0.010553 |
| 2 | (5600.0, 8400.0] | 78 | 0.269231 | 0.000306 | 21.0 | 57.0 | 0.000378 | 0.000286 | 0.842560 | 0.085608 | 0.247352 | 0.010553 |
| 3 | (8400.0, 11200.0] | 350 | 0.291429 | 0.001372 | 102.0 | 248.0 | 0.001835 | 0.001243 | 0.906712 | 0.022198 | 0.064152 | 0.010553 |
| 4 | (11200.0, 14000.0] | 647 | 0.281298 | 0.002537 | 182.0 | 465.0 | 0.003275 | 0.002331 | 0.877455 | 0.010130 | 0.029257 | 0.010553 |
| 5 | (14000.0, 16800.0] | 1014 | 0.279093 | 0.003976 | 283.0 | 731.0 | 0.005092 | 0.003665 | 0.871081 | 0.002206 | 0.006374 | 0.010553 |
| 6 | (16800.0, 19600.0] | 1285 | 0.257588 | 0.005038 | 331.0 | 954.0 | 0.005956 | 0.004783 | 0.808830 | 0.021505 | 0.062251 | 0.010553 |
| 7 | (19600.0, 22400.0] | 2672 | 0.264222 | 0.010477 | 706.0 | 1966.0 | 0.012704 | 0.009856 | 0.828057 | 0.006634 | 0.019227 | 0.010553 |
| 8 | (22400.0, 25200.0] | 4492 | 0.255120 | 0.017613 | 1146.0 | 3346.0 | 0.020621 | 0.016775 | 0.801672 | 0.009101 | 0.026385 | 0.010553 |
| 9 | (25200.0, 28000.0] | 3879 | 0.260376 | 0.015209 | 1010.0 | 2869.0 | 0.018174 | 0.014383 | 0.816916 | 0.005256 | 0.015243 | 0.010553 |
| 10 | (28000.0, 30800.0] | 6069 | 0.245840 | 0.023796 | 1492.0 | 4577.0 | 0.026847 | 0.022946 | 0.774714 | 0.014537 | 0.042202 | 0.010553 |
| 11 | (30800.0, 33600.0] | 5768 | 0.248440 | 0.022616 | 1433.0 | 4335.0 | 0.025785 | 0.021733 | 0.782273 | 0.002600 | 0.007559 | 0.010553 |
| 12 | (33600.0, 36400.0] | 9387 | 0.251838 | 0.036806 | 2364.0 | 7023.0 | 0.042537 | 0.035209 | 0.792144 | 0.003398 | 0.009871 | 0.010553 |
| 13 | (36400.0, 39200.0] | 6374 | 0.238783 | 0.024992 | 1522.0 | 4852.0 | 0.027386 | 0.024325 | 0.754172 | 0.013055 | 0.037972 | 0.010553 |
| 14 | (39200.0, 42000.0] | 13875 | 0.234739 | 0.054403 | 3257.0 | 10618.0 | 0.058605 | 0.053232 | 0.742383 | 0.004044 | 0.011789 | 0.010553 |
| 15 | (42000.0, 44800.0] | 5049 | 0.246583 | 0.019797 | 1245.0 | 3804.0 | 0.022402 | 0.019071 | 0.776877 | 0.011845 | 0.034494 | 0.010553 |
| 16 | (44800.0, 47600.0] | 11615 | 0.232114 | 0.045542 | 2696.0 | 8919.0 | 0.048511 | 0.044715 | 0.734722 | 0.014470 | 0.042155 | 0.010553 |
| 17 | (47600.0, 50400.0] | 15763 | 0.232443 | 0.061806 | 3664.0 | 12099.0 | 0.065929 | 0.060657 | 0.735684 | 0.000329 | 0.000962 | 0.010553 |
| 18 | (50400.0, 53200.0] | 8349 | 0.219907 | 0.032736 | 1836.0 | 6513.0 | 0.033036 | 0.032652 | 0.699011 | 0.012536 | 0.036673 | 0.010553 |
| 19 | (53200.0, 56000.0] | 11837 | 0.225395 | 0.046412 | 2668.0 | 9169.0 | 0.048007 | 0.045968 | 0.715086 | 0.005488 | 0.016074 | 0.010553 |
| 20 | (56000.0, 58800.0] | 5150 | 0.220194 | 0.020193 | 1134.0 | 4016.0 | 0.020405 | 0.020134 | 0.699855 | 0.005201 | 0.015231 | 0.010553 |
| 21 | (58800.0, 61600.0] | 13673 | 0.224823 | 0.053611 | 3074.0 | 10599.0 | 0.055313 | 0.053137 | 0.713411 | 0.004628 | 0.013556 | 0.010553 |
| 22 | (61600.0, 64400.0] | 6577 | 0.216360 | 0.025788 | 1423.0 | 5154.0 | 0.025605 | 0.025839 | 0.688607 | 0.008463 | 0.024804 | 0.010553 |
| 23 | (64400.0, 67200.0] | 11622 | 0.223714 | 0.045569 | 2600.0 | 9022.0 | 0.046784 | 0.045231 | 0.710165 | 0.007354 | 0.021558 | 0.010553 |
| 24 | (67200.0, 70000.0] | 11659 | 0.217343 | 0.045714 | 2534.0 | 9125.0 | 0.045596 | 0.045747 | 0.691492 | 0.006371 | 0.018673 | 0.010553 |
| 25 | (70000.0, 72800.0] | 4992 | 0.199319 | 0.019573 | 995.0 | 3997.0 | 0.017904 | 0.020039 | 0.638407 | 0.018024 | 0.053085 | 0.010553 |
| 26 | (72800.0, 75600.0] | 10040 | 0.213546 | 0.039366 | 2144.0 | 7896.0 | 0.038578 | 0.039586 | 0.680341 | 0.014227 | 0.041934 | 0.010553 |
| 27 | (75600.0, 78400.0] | 4336 | 0.200876 | 0.017001 | 871.0 | 3465.0 | 0.015673 | 0.017371 | 0.643010 | 0.012669 | 0.037331 | 0.010553 |
| 28 | (78400.0, 81200.0] | 9345 | 0.209524 | 0.036641 | 1958.0 | 7387.0 | 0.035232 | 0.037034 | 0.668512 | 0.008647 | 0.025502 | 0.010553 |
| 29 | (81200.0, 84000.0] | 4429 | 0.191465 | 0.017366 | 848.0 | 3581.0 | 0.015259 | 0.017953 | 0.615143 | 0.018058 | 0.053369 | 0.010553 |
| 30 | (84000.0, 86800.0] | 6724 | 0.206425 | 0.026364 | 1388.0 | 5336.0 | 0.024975 | 0.026752 | 0.659384 | 0.014959 | 0.044240 | 0.010553 |
| 31 | (86800.0, 89600.0] | 3212 | 0.190224 | 0.012594 | 611.0 | 2601.0 | 0.010994 | 0.013040 | 0.611458 | 0.016201 | 0.047925 | 0.010553 |
| 32 | (89600.0, 92400.0] | 7977 | 0.193180 | 0.031277 | 1541.0 | 6436.0 | 0.027728 | 0.032266 | 0.620231 | 0.002956 | 0.008773 | 0.010553 |
| 33 | (92400.0, 95200.0] | 4984 | 0.186798 | 0.019542 | 931.0 | 4053.0 | 0.016752 | 0.020319 | 0.601274 | 0.006383 | 0.018957 | 0.010553 |
| 34 | (95200.0, 98000.0] | 3475 | 0.195108 | 0.013625 | 678.0 | 2797.0 | 0.012200 | 0.014023 | 0.625944 | 0.008310 | 0.024670 | 0.010553 |
| 35 | (98000.0, 100800.0] | 6710 | 0.188972 | 0.026310 | 1268.0 | 5442.0 | 0.022816 | 0.027283 | 0.607738 | 0.006136 | 0.018206 | 0.010553 |
| 36 | (100800.0, 103600.0] | 2338 | 0.165526 | 0.009167 | 387.0 | 1951.0 | 0.006964 | 0.009781 | 0.537625 | 0.023446 | 0.070113 | 0.010553 |
| 37 | (103600.0, 106400.0] | 3469 | 0.170366 | 0.013602 | 591.0 | 2878.0 | 0.010634 | 0.014429 | 0.552176 | 0.004840 | 0.014551 | 0.010553 |
| 38 | (106400.0, 109200.0] | 1740 | 0.172989 | 0.006822 | 301.0 | 1439.0 | 0.005416 | 0.007214 | 0.560042 | 0.002622 | 0.007866 | 0.010553 |
| 39 | (109200.0, 112000.0] | 4536 | 0.189153 | 0.017785 | 858.0 | 3678.0 | 0.015439 | 0.018439 | 0.608278 | 0.016165 | 0.048236 | 0.010553 |
| 40 | (112000.0, 114800.0] | 878 | 0.149203 | 0.003443 | 131.0 | 747.0 | 0.002357 | 0.003745 | 0.488222 | 0.039951 | 0.120056 | 0.010553 |
| 41 | (114800.0, 117600.0] | 2454 | 0.174002 | 0.009622 | 427.0 | 2027.0 | 0.007683 | 0.010162 | 0.563078 | 0.024799 | 0.074856 | 0.010553 |
| 42 | (117600.0, 120400.0] | 4985 | 0.201204 | 0.019546 | 1003.0 | 3982.0 | 0.018048 | 0.019963 | 0.643977 | 0.027202 | 0.080899 | 0.010553 |
| 43 | (120400.0, 123200.0] | 805 | 0.159006 | 0.003156 | 128.0 | 677.0 | 0.002303 | 0.003394 | 0.517955 | 0.042197 | 0.126022 | 0.010553 |
| 44 | (123200.0, 126000.0] | 2880 | 0.187847 | 0.011292 | 541.0 | 2339.0 | 0.009735 | 0.011726 | 0.604396 | 0.028841 | 0.086440 | 0.010553 |
| 45 | (126000.0, 128800.0] | 584 | 0.136986 | 0.002290 | 80.0 | 504.0 | 0.001439 | 0.002527 | 0.450885 | 0.050861 | 0.153511 | 0.010553 |
| 46 | (128800.0, 131600.0] | 2504 | 0.171725 | 0.009818 | 430.0 | 2074.0 | 0.007737 | 0.010398 | 0.556254 | 0.034739 | 0.105369 | 0.010553 |
| 47 | (131600.0, 134400.0] | 641 | 0.159126 | 0.002513 | 102.0 | 539.0 | 0.001835 | 0.002702 | 0.518318 | 0.012599 | 0.037936 | 0.010553 |
| 48 | (134400.0, 137200.0] | 1583 | 0.162982 | 0.006207 | 258.0 | 1325.0 | 0.004642 | 0.006643 | 0.529958 | 0.003855 | 0.011640 | 0.010553 |
| 49 | (137200.0, 140000.0] | 2121 | 0.165488 | 0.008316 | 351.0 | 1770.0 | 0.006316 | 0.008874 | 0.537510 | 0.002506 | 0.007552 | 0.010553 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# WoE is monotonically decreasing with income, so we split income in 10 equal categories
df_inputs_prepr['log_annual_inc:<20K'] = np.where((df_inputs_prepr['annual_inc'] <= 20000), 1, 0)
df_inputs_prepr['annual_inc:20K-30K'] = np.where((df_inputs_prepr['annual_inc'] > 20000) & (df_inputs_prepr['annual_inc'] <= 30000), 1, 0)
df_inputs_prepr['annual_inc:30K-40K'] = np.where((df_inputs_prepr['annual_inc'] > 30000) & (df_inputs_prepr['annual_inc'] <= 40000), 1, 0)
df_inputs_prepr['annual_inc:40K-50K'] = np.where((df_inputs_prepr['annual_inc'] > 40000) & (df_inputs_prepr['annual_inc'] <= 50000), 1, 0)
df_inputs_prepr['annual_inc:50K-60K'] = np.where((df_inputs_prepr['annual_inc'] > 50000) & (df_inputs_prepr['annual_inc'] <= 60000), 1, 0)
df_inputs_prepr['annual_inc:60K-70K'] = np.where((df_inputs_prepr['annual_inc'] > 60000) & (df_inputs_prepr['annual_inc'] <= 70000), 1, 0)
df_inputs_prepr['annual_inc:70K-80K'] = np.where((df_inputs_prepr['annual_inc'] > 70000) & (df_inputs_prepr['annual_inc'] <= 80000), 1, 0)
df_inputs_prepr['annual_inc:80K-90K'] = np.where((df_inputs_prepr['annual_inc'] > 80000) & (df_inputs_prepr['annual_inc'] <= 90000), 1, 0)
df_inputs_prepr['annual_inc:90K-100K'] = np.where((df_inputs_prepr['annual_inc'] > 90000) & (df_inputs_prepr['annual_inc'] <= 100000), 1, 0)
df_inputs_prepr['annual_inc:100K-120K'] = np.where((df_inputs_prepr['annual_inc'] > 100000) & (df_inputs_prepr['annual_inc'] <= 120000), 1, 0)
df_inputs_prepr['annual_inc:120K-140K'] = np.where((df_inputs_prepr['annual_inc'] > 120000) & (df_inputs_prepr['annual_inc'] <= 140000), 1, 0)
df_inputs_prepr['annual_inc:>140K'] = np.where((df_inputs_prepr['annual_inc'] > 140000), 1, 0)
df_inputs_prepr = df_inputs_prepr.drop(columns = ['annual_inc_factor'])
# Drop the provisory feature
Variable: 'emp_length_int'¶
df_inputs_prepr['emp_length_int'].unique()
array([ 7., 2., 0., 10., 8., 1., 3., 5., 9., 4., 6.])
# mths_since_issue_d
df_temp = woe_ordered_continuous(df_inputs_prepr, 'emp_length_int', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| emp_length_int | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 22011 | 0.221707 | 0.080264 | 4880.0 | 17131.0 | 0.082990 | 0.079519 | 0.714738 | NaN | NaN | 0.000484 |
| 1 | 1.0 | 18068 | 0.223987 | 0.065885 | 4047.0 | 14021.0 | 0.068824 | 0.065083 | 0.721482 | 0.002280 | 0.006744 | 0.000484 |
| 2 | 2.0 | 24789 | 0.210698 | 0.090394 | 5223.0 | 19566.0 | 0.088824 | 0.090822 | 0.682083 | 0.013289 | 0.039399 | 0.000484 |
| 3 | 3.0 | 22148 | 0.217446 | 0.080763 | 4816.0 | 17332.0 | 0.081902 | 0.080452 | 0.702116 | 0.006748 | 0.020033 | 0.000484 |
| 4 | 4.0 | 16597 | 0.216244 | 0.060521 | 3589.0 | 13008.0 | 0.061035 | 0.060381 | 0.698551 | 0.001202 | 0.003565 | 0.000484 |
| 5 | 5.0 | 17151 | 0.209550 | 0.062541 | 3594.0 | 13557.0 | 0.061120 | 0.062929 | 0.678670 | 0.006693 | 0.019881 | 0.000484 |
| 6 | 6.0 | 12651 | 0.204332 | 0.046132 | 2585.0 | 10066.0 | 0.043961 | 0.046725 | 0.663128 | 0.005219 | 0.015542 | 0.000484 |
| 7 | 7.0 | 12214 | 0.202964 | 0.044539 | 2479.0 | 9735.0 | 0.042158 | 0.045188 | 0.659048 | 0.001368 | 0.004080 | 0.000484 |
| 8 | 8.0 | 12253 | 0.208276 | 0.044681 | 2552.0 | 9701.0 | 0.043400 | 0.045030 | 0.674876 | 0.005312 | 0.015828 | 0.000484 |
| 9 | 9.0 | 10294 | 0.220128 | 0.037537 | 2266.0 | 8028.0 | 0.038536 | 0.037265 | 0.710063 | 0.011853 | 0.035187 | 0.000484 |
| 10 | 10.0 | 106058 | 0.214703 | 0.386743 | 22771.0 | 83287.0 | 0.387249 | 0.386605 | 0.693980 | 0.005425 | 0.016083 | 0.000484 |
plot_by_woe(df_temp)
# We plot the weight of evidence values, rotating the labels 90 degrees.
# We create the following categories:
# 0 , 1, 2 - 4, 5 - 7, 8 - 9, 10
df_inputs_prepr['emp_length_int:0'] = np.where(df_inputs_prepr['emp_length_int'].isin([0]), 1, 0)
df_inputs_prepr['emp_length_int:1'] = np.where(df_inputs_prepr['emp_length_int'].isin([1]), 1, 0)
df_inputs_prepr['emp_length_int:2-4'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(2, 5)), 1, 0)
df_inputs_prepr['emp_length_int:5-7'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(5, 8)), 1, 0)
df_inputs_prepr['emp_length_int:8-9'] = np.where(df_inputs_prepr['emp_length_int'].isin(range(8, 10)), 1, 0)
df_inputs_prepr['emp_length_int:10'] = np.where(df_inputs_prepr['emp_length_int'].isin([10]), 1, 0)
Variable: 'dti'¶
# int_rate
df_inputs_prepr['dti_factor'] = pd.cut(df_inputs_prepr['dti'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'dti_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| dti_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.999, 19.98] | 164167 | 0.183502 | 0.598638 | 30125.0 | 134042.0 | 0.512313 | 0.622201 | 0.600696 | NaN | NaN | inf |
| 1 | (19.98, 39.96] | 108557 | 0.259136 | 0.395855 | 28131.0 | 80426.0 | 0.478402 | 0.373324 | 0.824818 | 0.075634 | 0.224122 | inf |
| 2 | (39.96, 59.94] | 1108 | 0.349278 | 0.004040 | 387.0 | 721.0 | 0.006581 | 0.003347 | 1.087383 | 0.090142 | 0.262565 | inf |
| 3 | (59.94, 79.92] | 204 | 0.392157 | 0.000744 | 80.0 | 124.0 | 0.001360 | 0.000576 | 1.213032 | 0.042879 | 0.125649 | inf |
| 4 | (79.92, 99.9] | 76 | 0.421053 | 0.000277 | 32.0 | 44.0 | 0.000544 | 0.000204 | 1.298691 | 0.028896 | 0.085659 | inf |
| 5 | (99.9, 119.88] | 32 | 0.406250 | 0.000117 | 13.0 | 19.0 | 0.000221 | 0.000088 | 1.254684 | 0.014803 | 0.044007 | inf |
| 6 | (119.88, 139.86] | 26 | 0.538462 | 0.000095 | 14.0 | 12.0 | 0.000238 | 0.000056 | 1.662846 | 0.132212 | 0.408161 | inf |
| 7 | (139.86, 159.84] | 5 | 0.400000 | 0.000018 | 2.0 | 3.0 | 0.000034 | 0.000014 | 1.236185 | 0.138462 | 0.426660 | inf |
| 8 | (159.84, 179.82] | 6 | 0.166667 | 0.000022 | 1.0 | 5.0 | 0.000017 | 0.000023 | 0.549702 | 0.233333 | 0.686483 | inf |
| 9 | (179.82, 199.8] | 4 | 0.000000 | 0.000015 | 0.0 | 4.0 | 0.000000 | 0.000019 | 0.000000 | 0.166667 | 0.549702 | inf |
| 10 | (199.8, 219.78] | 5 | 0.200000 | 0.000018 | 1.0 | 4.0 | 0.000017 | 0.000019 | 0.650199 | 0.200000 | 0.650199 | inf |
| 11 | (219.78, 239.76] | 5 | 0.000000 | 0.000018 | 0.0 | 5.0 | 0.000000 | 0.000023 | 0.000000 | 0.200000 | 0.650199 | inf |
| 12 | (239.76, 259.74] | 6 | 0.666667 | 0.000022 | 4.0 | 2.0 | 0.000068 | 0.000009 | 2.119548 | 0.666667 | 2.119548 | inf |
| 13 | (259.74, 279.72] | 3 | 0.333333 | 0.000011 | 1.0 | 2.0 | 0.000017 | 0.000009 | 1.040928 | 0.333333 | 1.078620 | inf |
| 14 | (279.72, 299.7] | 3 | 0.333333 | 0.000011 | 1.0 | 2.0 | 0.000017 | 0.000009 | 1.040928 | 0.000000 | 0.000000 | inf |
| 15 | (299.7, 319.68] | 5 | 0.200000 | 0.000018 | 1.0 | 4.0 | 0.000017 | 0.000019 | 0.650199 | 0.133333 | 0.390729 | inf |
| 16 | (319.68, 339.66] | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.300000 | 0.889607 | inf |
| 17 | (339.66, 359.64] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 18 | (359.64, 379.62] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 19 | (379.62, 399.6] | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 20 | (399.6, 419.58] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 1.000000 | inf | inf |
| 21 | (419.58, 439.56] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 22 | (439.56, 459.54] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 23 | (459.54, 479.52] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 24 | (479.52, 499.5] | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | NaN | NaN | inf |
| 25 | (499.5, 519.48] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 26 | (519.48, 539.46] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 27 | (539.46, 559.44] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 28 | (559.44, 579.42] | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | NaN | NaN | inf |
| 29 | (579.42, 599.4] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 30 | (599.4, 619.38] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 31 | (619.38, 639.36] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 32 | (639.36, 659.34] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 33 | (659.34, 679.32] | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | NaN | NaN | inf |
| 34 | (679.32, 699.3] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 35 | (699.3, 719.28] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 36 | (719.28, 739.26] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 37 | (739.26, 759.24] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 38 | (759.24, 779.22] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 39 | (779.22, 799.2] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 40 | (799.2, 819.18] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 41 | (819.18, 839.16] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 42 | (839.16, 859.14] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 43 | (859.14, 879.12] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 44 | (879.12, 899.1] | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | NaN | NaN | inf |
| 45 | (899.1, 919.08] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 46 | (919.08, 939.06] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 47 | (939.06, 959.04] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 48 | (959.04, 979.02] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 49 | (979.02, 999.0] | 9 | 0.333333 | 0.000033 | 3.0 | 6.0 | 0.000051 | 0.000028 | 1.040928 | NaN | NaN | inf |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# one category for everyone with 'dti' higher than 40 is considered.
# the categories of everyone with 'dti' less or equal 40.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['dti'] <= 40., : ]
#loan_data_temp = loan_data_temp.reset_index(drop = True)
#df_inputs_prepr_temp
df_inputs_prepr_temp["dti_factor"] = pd.cut(df_inputs_prepr_temp['dti'], 2)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'dti_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3006790741.py:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp["dti_factor"] = pd.cut(df_inputs_prepr_temp['dti'], 2) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| dti_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.04, 20.0] | 164391 | 0.183471 | 0.602712 | 30161.0 | 134230.0 | 0.51767 | 0.625813 | 0.602782 | NaN | NaN | 0.024369 |
| 1 | (20.0, 40.0] | 108361 | 0.259337 | 0.397288 | 28102.0 | 80259.0 | 0.48233 | 0.374187 | 0.828119 | 0.075866 | 0.225336 | 0.024369 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '<= 10.', '10. - 20.', '20. - 30.', '30. - 40.', (> 40.)
df_inputs_prepr['dti:<=10'] = np.where((df_inputs_prepr['dti'] <= 10.), 1, 0)
df_inputs_prepr['dti:10-20'] = np.where((df_inputs_prepr['dti'] > 10.) & (df_inputs_prepr['dti'] <= 20.), 1, 0)
df_inputs_prepr['dti:20-30'] = np.where((df_inputs_prepr['dti'] > 20.) & (df_inputs_prepr['dti'] <= 30.), 1, 0)
df_inputs_prepr['dti:30-40'] = np.where((df_inputs_prepr['dti'] > 30.) & (df_inputs_prepr['dti'] <= 40.), 1, 0)
df_inputs_prepr['dti:>40'] = np.where((df_inputs_prepr['dti'] > 40.), 1, 0)
df_inputs_prepr = df_inputs_prepr.drop(columns = ['dti_factor'])
Variable: 'min_mths_since_delinquency'¶
# one category will be created for 'min_mths_since_delinquency' = 999 corresponds to missing values.
#***********************************************************************************************
# the categories of everyone with 'min_mths_since_delinquency' less 300 (max value = 226).
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['min_mths_since_delinquency'] <= 500, : ]
#loan_data_temp = loan_data_temp.reset_index(drop = True)
#df_inputs_prepr_temp
df_inputs_prepr_temp['min_mths_since_delinquency_factor'] = pd.cut(df_inputs_prepr_temp['min_mths_since_delinquency'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'min_mths_since_delinquency_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1149940471.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['min_mths_since_delinquency_factor'] = pd.cut(df_inputs_prepr_temp['min_mths_since_delinquency'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| min_mths_since_delinquency_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.156, 3.12] | 4315 | 0.221089 | 0.031104 | 954.0 | 3361.0 | 0.030663 | 0.031232 | 0.684007 | NaN | NaN | 0.001227 |
| 1 | (3.12, 6.24] | 6932 | 0.242787 | 0.049969 | 1683.0 | 5249.0 | 0.054095 | 0.048776 | 0.746239 | 0.021698 | 0.062232 | 0.001227 |
| 2 | (6.24, 9.36] | 7868 | 0.235511 | 0.056716 | 1853.0 | 6015.0 | 0.059559 | 0.055894 | 0.725409 | 0.007276 | 0.020830 | 0.001227 |
| 3 | (9.36, 12.48] | 7648 | 0.231041 | 0.055130 | 1767.0 | 5881.0 | 0.056795 | 0.054649 | 0.712594 | 0.004470 | 0.012815 | 0.001227 |
| 4 | (12.48, 15.6] | 7595 | 0.228440 | 0.054748 | 1735.0 | 5860.0 | 0.055766 | 0.054453 | 0.705130 | 0.002601 | 0.007464 | 0.001227 |
| 5 | (15.6, 18.72] | 7320 | 0.223497 | 0.052766 | 1636.0 | 5684.0 | 0.052584 | 0.052818 | 0.690932 | 0.004942 | 0.014198 | 0.001227 |
| 6 | (18.72, 21.84] | 7057 | 0.223466 | 0.050870 | 1577.0 | 5480.0 | 0.050688 | 0.050922 | 0.690843 | 0.000031 | 0.000090 | 0.001227 |
| 7 | (21.84, 24.96] | 6578 | 0.223776 | 0.047417 | 1472.0 | 5106.0 | 0.047313 | 0.047447 | 0.691734 | 0.000310 | 0.000892 | 0.001227 |
| 8 | (24.96, 28.08] | 9075 | 0.225785 | 0.065416 | 2049.0 | 7026.0 | 0.065859 | 0.065288 | 0.697507 | 0.002009 | 0.005773 | 0.001227 |
| 9 | (28.08, 31.2] | 6289 | 0.223565 | 0.045334 | 1406.0 | 4883.0 | 0.045192 | 0.045375 | 0.691127 | 0.002220 | 0.006380 | 0.001227 |
| 10 | (31.2, 34.32] | 6001 | 0.221296 | 0.043258 | 1328.0 | 4673.0 | 0.042684 | 0.043423 | 0.684604 | 0.002269 | 0.006523 | 0.001227 |
| 11 | (34.32, 37.44] | 5808 | 0.217631 | 0.041866 | 1264.0 | 4544.0 | 0.040627 | 0.042225 | 0.674053 | 0.003666 | 0.010551 | 0.001227 |
| 12 | (37.44, 40.56] | 5682 | 0.228793 | 0.040958 | 1300.0 | 4382.0 | 0.041785 | 0.040719 | 0.706143 | 0.011162 | 0.032090 | 0.001227 |
| 13 | (40.56, 43.68] | 5444 | 0.216385 | 0.039243 | 1178.0 | 4266.0 | 0.037863 | 0.039641 | 0.670464 | 0.012408 | 0.035679 | 0.001227 |
| 14 | (43.68, 46.8] | 5261 | 0.224672 | 0.037923 | 1182.0 | 4079.0 | 0.037992 | 0.037904 | 0.694309 | 0.008287 | 0.023845 | 0.001227 |
| 15 | (46.8, 49.92] | 4929 | 0.218097 | 0.035530 | 1075.0 | 3854.0 | 0.034553 | 0.035813 | 0.675395 | 0.006575 | 0.018914 | 0.001227 |
| 16 | (49.92, 53.04] | 4738 | 0.225623 | 0.034153 | 1069.0 | 3669.0 | 0.034360 | 0.034094 | 0.697040 | 0.007526 | 0.021645 | 0.001227 |
| 17 | (53.04, 56.16] | 3677 | 0.220560 | 0.026505 | 811.0 | 2866.0 | 0.026067 | 0.026632 | 0.682486 | 0.005062 | 0.014555 | 0.001227 |
| 18 | (56.16, 59.28] | 3671 | 0.211114 | 0.026462 | 775.0 | 2896.0 | 0.024910 | 0.026911 | 0.655265 | 0.009446 | 0.027221 | 0.001227 |
| 19 | (59.28, 62.4] | 3379 | 0.215448 | 0.024357 | 728.0 | 2651.0 | 0.023399 | 0.024634 | 0.667765 | 0.004334 | 0.012500 | 0.001227 |
| 20 | (62.4, 65.52] | 3343 | 0.227640 | 0.024098 | 761.0 | 2582.0 | 0.024460 | 0.023993 | 0.702834 | 0.012191 | 0.035068 | 0.001227 |
| 21 | (65.52, 68.64] | 3395 | 0.213844 | 0.024473 | 726.0 | 2669.0 | 0.023335 | 0.024801 | 0.663140 | 0.013796 | 0.039694 | 0.001227 |
| 22 | (68.64, 71.76] | 3173 | 0.220611 | 0.022872 | 700.0 | 2473.0 | 0.022499 | 0.022980 | 0.682633 | 0.006768 | 0.019493 | 0.001227 |
| 23 | (71.76, 74.88] | 2952 | 0.226287 | 0.021279 | 668.0 | 2284.0 | 0.021471 | 0.021224 | 0.698949 | 0.005676 | 0.016317 | 0.001227 |
| 24 | (74.88, 78.0] | 3656 | 0.210613 | 0.026354 | 770.0 | 2886.0 | 0.024749 | 0.026818 | 0.653817 | 0.015675 | 0.045132 | 0.001227 |
| 25 | (78.0, 81.12] | 2322 | 0.212748 | 0.016738 | 494.0 | 1828.0 | 0.015878 | 0.016986 | 0.659978 | 0.002135 | 0.006161 | 0.001227 |
| 26 | (81.12, 84.24] | 393 | 0.259542 | 0.002833 | 102.0 | 291.0 | 0.003278 | 0.002704 | 0.794086 | 0.046794 | 0.134107 | 0.001227 |
| 27 | (84.24, 87.36] | 59 | 0.186441 | 0.000425 | 11.0 | 48.0 | 0.000354 | 0.000446 | 0.583710 | 0.073101 | 0.210376 | 0.001227 |
| 28 | (87.36, 90.48] | 29 | 0.241379 | 0.000209 | 7.0 | 22.0 | 0.000225 | 0.000204 | 0.742212 | 0.054939 | 0.158502 | 0.001227 |
| 29 | (90.48, 93.6] | 30 | 0.133333 | 0.000216 | 4.0 | 26.0 | 0.000129 | 0.000242 | 0.426670 | 0.108046 | 0.315542 | 0.001227 |
| 30 | (93.6, 96.72] | 22 | 0.181818 | 0.000159 | 4.0 | 18.0 | 0.000129 | 0.000167 | 0.570220 | 0.048485 | 0.143550 | 0.001227 |
| 31 | (96.72, 99.84] | 21 | 0.238095 | 0.000151 | 5.0 | 16.0 | 0.000161 | 0.000149 | 0.732812 | 0.056277 | 0.162591 | 0.001227 |
| 32 | (99.84, 102.96] | 11 | 0.454545 | 0.000079 | 5.0 | 6.0 | 0.000161 | 0.000056 | 1.356470 | 0.216450 | 0.623658 | 0.001227 |
| 33 | (102.96, 106.08] | 13 | 0.230769 | 0.000094 | 3.0 | 10.0 | 0.000096 | 0.000093 | 0.711815 | 0.223776 | 0.644655 | 0.001227 |
| 34 | (106.08, 109.2] | 10 | 0.300000 | 0.000072 | 3.0 | 7.0 | 0.000096 | 0.000065 | 0.909230 | 0.069231 | 0.197414 | 0.001227 |
| 35 | (109.2, 112.32] | 7 | 0.285714 | 0.000050 | 2.0 | 5.0 | 0.000064 | 0.000046 | 0.868604 | 0.014286 | 0.040625 | 0.001227 |
| 36 | (112.32, 115.44] | 7 | 0.285714 | 0.000050 | 2.0 | 5.0 | 0.000064 | 0.000046 | 0.868604 | 0.000000 | 0.000000 | 0.001227 |
| 37 | (115.44, 118.56] | 2 | 0.000000 | 0.000014 | 0.0 | 2.0 | 0.000000 | 0.000019 | 0.000000 | 0.285714 | 0.868604 | 0.001227 |
| 38 | (118.56, 121.68] | 2 | 0.000000 | 0.000014 | 0.0 | 2.0 | 0.000000 | 0.000019 | 0.000000 | 0.000000 | 0.000000 | 0.001227 |
| 39 | (121.68, 124.8] | 2 | 0.500000 | 0.000014 | 1.0 | 1.0 | 0.000032 | 0.000009 | 1.494914 | 0.500000 | 1.494914 | 0.001227 |
| 40 | (124.8, 127.92] | 3 | 0.666667 | 0.000022 | 2.0 | 1.0 | 0.000064 | 0.000009 | 2.069127 | 0.166667 | 0.574213 | 0.001227 |
| 41 | (127.92, 131.04] | 1 | 0.000000 | 0.000007 | 0.0 | 1.0 | 0.000000 | 0.000009 | 0.000000 | 0.666667 | 2.069127 | 0.001227 |
| 42 | (131.04, 134.16] | 1 | 0.000000 | 0.000007 | 0.0 | 1.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | 0.001227 |
| 43 | (134.16, 137.28] | 2 | 0.000000 | 0.000014 | 0.0 | 2.0 | 0.000000 | 0.000019 | 0.000000 | 0.000000 | 0.000000 | 0.001227 |
| 44 | (137.28, 140.4] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001227 |
| 45 | (140.4, 143.52] | 2 | 0.000000 | 0.000014 | 0.0 | 2.0 | 0.000000 | 0.000019 | 0.000000 | NaN | NaN | 0.001227 |
| 46 | (143.52, 146.64] | 1 | 0.000000 | 0.000007 | 0.0 | 1.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | 0.001227 |
| 47 | (146.64, 149.76] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001227 |
| 48 | (149.76, 152.88] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001227 |
| 49 | (152.88, 156.0] | 1 | 0.000000 | 0.000007 | 0.0 | 1.0 | 0.000000 | 0.000009 | 0.000000 | NaN | NaN | 0.001227 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
## We create the following categories:
# 'Missing', # <=20, # 20 - 40, # 40 - 80, # >80
df_inputs_prepr['min_mths_since_delinquency:Missing'] = np.where(df_inputs_prepr['min_mths_since_delinquency'].isin([999]), 1, 0)
df_inputs_prepr['min_mths_since_delinquency:<=20'] = np.where(df_inputs_prepr['min_mths_since_delinquency'].isin(range(21)), 1, 0)
df_inputs_prepr['min_mths_since_delinquency:20-40'] = np.where(df_inputs_prepr['min_mths_since_delinquency'].isin(range(21, 41)), 1, 0)
df_inputs_prepr['min_mths_since_delinquency:40-80'] = np.where(df_inputs_prepr['min_mths_since_delinquency'].isin(range(41, 81)), 1, 0)
df_inputs_prepr['min_mths_since_delinquency:>80'] = np.where(df_inputs_prepr['min_mths_since_delinquency'].isin(range(81, int(df_inputs_prepr_temp['mths_since_earliest_cr_line'].max()))), 1, 0)
Variable: 'mths_since_earliest_cr_line'¶
# mths_since_earliest_cr_line
df_inputs_prepr['mths_since_earliest_cr_line_factor'] = pd.cut(df_inputs_prepr['mths_since_earliest_cr_line'], 45)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'mths_since_earliest_cr_line_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| mths_since_earliest_cr_line_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (63.009, 86.022] | 828 | 0.379227 | 0.003019 | 314.0 | 514.0 | 0.005340 | 0.002386 | 1.174995 | NaN | NaN | 0.014181 |
| 1 | (86.022, 108.044] | 3257 | 0.312558 | 0.011877 | 1018.0 | 2239.0 | 0.017312 | 0.010393 | 0.980488 | 0.066669 | 0.194507 | 0.014181 |
| 2 | (108.044, 130.067] | 6589 | 0.274093 | 0.024027 | 1806.0 | 4783.0 | 0.030713 | 0.022202 | 0.868512 | 0.038464 | 0.111977 | 0.014181 |
| 3 | (130.067, 152.089] | 8358 | 0.249103 | 0.030478 | 2082.0 | 6276.0 | 0.035407 | 0.029132 | 0.795429 | 0.024991 | 0.073083 | 0.014181 |
| 4 | (152.089, 174.111] | 16774 | 0.247168 | 0.061167 | 4146.0 | 12628.0 | 0.070508 | 0.058617 | 0.789754 | 0.001934 | 0.005675 | 0.014181 |
| 5 | (174.111, 196.133] | 26766 | 0.247105 | 0.097603 | 6614.0 | 20152.0 | 0.112479 | 0.093542 | 0.789567 | 0.000064 | 0.000187 | 0.014181 |
| 6 | (196.133, 218.156] | 31526 | 0.231618 | 0.114960 | 7302.0 | 24224.0 | 0.124179 | 0.112444 | 0.744016 | 0.015486 | 0.045551 | 0.014181 |
| 7 | (218.156, 240.178] | 33108 | 0.211278 | 0.120729 | 6995.0 | 26113.0 | 0.118959 | 0.121212 | 0.683807 | 0.020340 | 0.060208 | 0.014181 |
| 8 | (240.178, 262.2] | 29997 | 0.203487 | 0.109385 | 6104.0 | 23893.0 | 0.103806 | 0.110907 | 0.660609 | 0.007791 | 0.023199 | 0.014181 |
| 9 | (262.2, 284.222] | 24356 | 0.199951 | 0.088815 | 4870.0 | 19486.0 | 0.082820 | 0.090451 | 0.650051 | 0.003536 | 0.010557 | 0.014181 |
| 10 | (284.222, 306.244] | 19873 | 0.201027 | 0.072467 | 3995.0 | 15878.0 | 0.067940 | 0.073703 | 0.653265 | 0.001076 | 0.003214 | 0.014181 |
| 11 | (306.244, 328.267] | 16852 | 0.188998 | 0.061451 | 3185.0 | 13667.0 | 0.054165 | 0.063440 | 0.617236 | 0.012028 | 0.036029 | 0.014181 |
| 12 | (328.267, 350.289] | 12111 | 0.189910 | 0.044163 | 2300.0 | 9811.0 | 0.039114 | 0.045541 | 0.619974 | 0.000912 | 0.002739 | 0.014181 |
| 13 | (350.289, 372.311] | 9398 | 0.185359 | 0.034270 | 1742.0 | 7656.0 | 0.029625 | 0.035538 | 0.606288 | 0.004551 | 0.013686 | 0.014181 |
| 14 | (372.311, 394.333] | 8182 | 0.180640 | 0.029836 | 1478.0 | 6704.0 | 0.025135 | 0.031119 | 0.592064 | 0.004718 | 0.014224 | 0.014181 |
| 15 | (394.333, 416.356] | 6746 | 0.189149 | 0.024599 | 1276.0 | 5470.0 | 0.021700 | 0.025391 | 0.617689 | 0.008509 | 0.025625 | 0.014181 |
| 16 | (416.356, 438.378] | 5158 | 0.184762 | 0.018809 | 953.0 | 4205.0 | 0.016207 | 0.019519 | 0.604490 | 0.004388 | 0.013198 | 0.014181 |
| 17 | (438.378, 460.4] | 4077 | 0.170223 | 0.014867 | 694.0 | 3383.0 | 0.011802 | 0.015703 | 0.560519 | 0.014538 | 0.043972 | 0.014181 |
| 18 | (460.4, 482.422] | 2610 | 0.167433 | 0.009517 | 437.0 | 2173.0 | 0.007432 | 0.010087 | 0.552035 | 0.002790 | 0.008484 | 0.014181 |
| 19 | (482.422, 504.444] | 1839 | 0.177814 | 0.006706 | 327.0 | 1512.0 | 0.005561 | 0.007018 | 0.583525 | 0.010381 | 0.031490 | 0.014181 |
| 20 | (504.444, 526.467] | 1718 | 0.185099 | 0.006265 | 318.0 | 1400.0 | 0.005408 | 0.006499 | 0.605506 | 0.007285 | 0.021982 | 0.014181 |
| 21 | (526.467, 548.489] | 1273 | 0.186174 | 0.004642 | 237.0 | 1036.0 | 0.004030 | 0.004809 | 0.608744 | 0.001075 | 0.003237 | 0.014181 |
| 22 | (548.489, 570.511] | 813 | 0.206642 | 0.002965 | 168.0 | 645.0 | 0.002857 | 0.002994 | 0.670013 | 0.020468 | 0.061269 | 0.014181 |
| 23 | (570.511, 592.533] | 623 | 0.191011 | 0.002272 | 119.0 | 504.0 | 0.002024 | 0.002339 | 0.623281 | 0.015631 | 0.046732 | 0.014181 |
| 24 | (592.533, 614.556] | 436 | 0.199541 | 0.001590 | 87.0 | 349.0 | 0.001480 | 0.001620 | 0.648828 | 0.008530 | 0.025547 | 0.014181 |
| 25 | (614.556, 636.578] | 338 | 0.230769 | 0.001233 | 78.0 | 260.0 | 0.001326 | 0.001207 | 0.741511 | 0.031228 | 0.092683 | 0.014181 |
| 26 | (636.578, 658.6] | 270 | 0.255556 | 0.000985 | 69.0 | 201.0 | 0.001173 | 0.000933 | 0.814339 | 0.024786 | 0.072828 | 0.014181 |
| 27 | (658.6, 680.622] | 141 | 0.205674 | 0.000514 | 29.0 | 112.0 | 0.000493 | 0.000520 | 0.667128 | 0.049882 | 0.147211 | 0.014181 |
| 28 | (680.622, 702.644] | 101 | 0.277228 | 0.000368 | 28.0 | 73.0 | 0.000476 | 0.000339 | 0.877653 | 0.071554 | 0.210525 | 0.014181 |
| 29 | (702.644, 724.667] | 47 | 0.255319 | 0.000171 | 12.0 | 35.0 | 0.000204 | 0.000162 | 0.813647 | 0.021909 | 0.064007 | 0.014181 |
| 30 | (724.667, 746.689] | 28 | 0.214286 | 0.000102 | 6.0 | 22.0 | 0.000102 | 0.000102 | 0.692740 | 0.041033 | 0.120906 | 0.014181 |
| 31 | (746.689, 768.711] | 18 | 0.388889 | 0.000066 | 7.0 | 11.0 | 0.000119 | 0.000051 | 1.203403 | 0.174603 | 0.510663 | 0.014181 |
| 32 | (768.711, 790.733] | 11 | 0.272727 | 0.000040 | 3.0 | 8.0 | 0.000051 | 0.000037 | 0.864527 | 0.116162 | 0.338877 | 0.014181 |
| 33 | (790.733, 812.756] | 3 | 0.333333 | 0.000011 | 1.0 | 2.0 | 0.000017 | 0.000009 | 1.040928 | 0.060606 | 0.176401 | 0.014181 |
| 34 | (812.756, 834.778] | 4 | 0.500000 | 0.000015 | 2.0 | 2.0 | 0.000034 | 0.000009 | 1.539806 | 0.166667 | 0.498878 | 0.014181 |
| 35 | (834.778, 856.8] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.500000 | 1.539806 | 0.014181 |
| 36 | (856.8, 878.822] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | 0.014181 |
| 37 | (878.822, 900.844] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.014181 |
| 38 | (900.844, 922.867] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.014181 |
| 39 | (922.867, 944.889] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | 0.014181 |
| 40 | (944.889, 966.911] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.014181 |
| 41 | (966.911, 988.933] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.014181 |
| 42 | (988.933, 1010.956] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.014181 |
| 43 | (1010.956, 1032.978] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.014181 |
| 44 | (1032.978, 1055.0] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | 0.014181 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories:
# <= 120, # 121 - 200, # 201 - 260, # 261 - 320, # 320 - 400, # 401 - 600, # => 601
df_inputs_prepr['mths_since_earliest_cr_line:<=120'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(121)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:121-200'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(121, 201)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:201-260'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(201,261)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:261-320'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(261, 321)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:321-400'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(321, 401)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:401-600'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(401, 601)), 1, 0)
df_inputs_prepr['mths_since_earliest_cr_line:>=601'] = np.where(df_inputs_prepr['mths_since_earliest_cr_line'].isin(range(601, int(df_inputs_prepr['mths_since_earliest_cr_line'].max()))), 1, 0)
df_inputs_prepr = df_inputs_prepr.drop(columns = ['mths_since_earliest_cr_line_factor'])
# Drop the provisory feature
Variable: 'delinq_2yrs'¶
df_inputs_prepr['delinq_2yrs'].unique()
array([ 0., 1., 2., 3., 4., 5., 14., 7., 6., 9., 10., 8., 13.,
11., 12., 18., 16., 15., 20., 17., 19., 36., 26., 27., 22., 24.])
# delinq_2yrs
df_temp = woe_ordered_continuous(df_inputs_prepr, 'delinq_2yrs', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| delinq_2yrs | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 221126 | 0.210726 | 0.806341 | 46597.0 | 174529.0 | 0.792439 | 0.810135 | 0.682165 | NaN | NaN | inf |
| 1 | 1.0 | 35193 | 0.223482 | 0.128332 | 7865.0 | 27328.0 | 0.133754 | 0.126852 | 0.719988 | 0.012756 | 0.037823 | inf |
| 2 | 2.0 | 10331 | 0.235989 | 0.037672 | 2438.0 | 7893.0 | 0.041461 | 0.036638 | 0.756893 | 0.012507 | 0.036905 | inf |
| 3 | 3.0 | 3763 | 0.257242 | 0.013722 | 968.0 | 2795.0 | 0.016462 | 0.012974 | 0.819275 | 0.021253 | 0.062381 | inf |
| 4 | 4.0 | 1651 | 0.245306 | 0.006020 | 405.0 | 1246.0 | 0.006888 | 0.005784 | 0.784287 | 0.011936 | 0.034988 | inf |
| 5 | 5.0 | 870 | 0.237931 | 0.003172 | 207.0 | 663.0 | 0.003520 | 0.003078 | 0.762610 | 0.007375 | 0.021677 | inf |
| 6 | 6.0 | 503 | 0.228628 | 0.001834 | 115.0 | 388.0 | 0.001956 | 0.001801 | 0.735194 | 0.009303 | 0.027417 | inf |
| 7 | 7.0 | 295 | 0.254237 | 0.001076 | 75.0 | 220.0 | 0.001275 | 0.001021 | 0.810478 | 0.025609 | 0.075285 | inf |
| 8 | 8.0 | 158 | 0.259494 | 0.000576 | 41.0 | 117.0 | 0.000697 | 0.000543 | 0.825865 | 0.005256 | 0.015387 | inf |
| 9 | 9.0 | 114 | 0.271930 | 0.000416 | 31.0 | 83.0 | 0.000527 | 0.000385 | 0.862200 | 0.012436 | 0.036335 | inf |
| 10 | 10.0 | 72 | 0.222222 | 0.000263 | 16.0 | 56.0 | 0.000272 | 0.000260 | 0.716262 | 0.049708 | 0.145938 | inf |
| 11 | 11.0 | 44 | 0.340909 | 0.000160 | 15.0 | 29.0 | 0.000255 | 0.000135 | 1.062988 | 0.118687 | 0.346727 | inf |
| 12 | 12.0 | 31 | 0.193548 | 0.000113 | 6.0 | 25.0 | 0.000102 | 0.000116 | 0.630891 | 0.147361 | 0.432097 | inf |
| 13 | 13.0 | 25 | 0.240000 | 0.000091 | 6.0 | 19.0 | 0.000102 | 0.000088 | 0.768697 | 0.046452 | 0.137806 | inf |
| 14 | 14.0 | 19 | 0.315789 | 0.000069 | 6.0 | 13.0 | 0.000102 | 0.000060 | 0.989887 | 0.075789 | 0.221191 | inf |
| 15 | 15.0 | 13 | 0.307692 | 0.000047 | 4.0 | 9.0 | 0.000068 | 0.000042 | 0.966339 | 0.008097 | 0.023548 | inf |
| 16 | 16.0 | 6 | 0.333333 | 0.000022 | 2.0 | 4.0 | 0.000034 | 0.000019 | 1.040928 | 0.025641 | 0.074589 | inf |
| 17 | 17.0 | 5 | 0.200000 | 0.000018 | 1.0 | 4.0 | 0.000017 | 0.000019 | 0.650199 | 0.133333 | 0.390729 | inf |
| 18 | 18.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.200000 | 0.650199 | inf |
| 19 | 19.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.500000 | 1.539806 | inf |
| 20 | 20.0 | 5 | 0.200000 | 0.000018 | 1.0 | 4.0 | 0.000017 | 0.000019 | 0.650199 | 0.300000 | 0.889607 | inf |
| 21 | 22.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.200000 | 0.650199 | inf |
| 22 | 24.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 23 | 26.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | inf |
| 24 | 27.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 25 | 36.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 0.000000 | NaN | inf |
plot_by_woe(df_temp)
# We plot the weight of evidence values.
# Categories: 0, 1, 2-9, >=10
df_inputs_prepr['delinq_2yrs:0'] = np.where((df_inputs_prepr['delinq_2yrs'] == 0), 1, 0)
df_inputs_prepr['delinq_2yrs:1'] = np.where((df_inputs_prepr['delinq_2yrs'] == 1), 1, 0)
df_inputs_prepr['delinq_2yrs:2-9'] = np.where((df_inputs_prepr['delinq_2yrs'] >= 2) & (df_inputs_prepr['delinq_2yrs'] <= 9), 1, 0)
df_inputs_prepr['delinq_2yrs:>=10'] = np.where((df_inputs_prepr['delinq_2yrs'] >= 10), 1, 0)
Variable: 'inq_last_6mths'¶
df_inputs_prepr['inq_last_6mths'].unique()
array([1., 2., 0., 3., 6., 4., 5., 7., 8.])
# inq_last_6mths
df_temp = woe_ordered_continuous(df_inputs_prepr, 'inq_last_6mths', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| inq_last_6mths | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 157016 | 0.195547 | 0.572562 | 30704.0 | 126312.0 | 0.522159 | 0.586320 | 0.636879 | NaN | NaN | 0.010856 |
| 1 | 1.0 | 74667 | 0.227222 | 0.272275 | 16966.0 | 57701.0 | 0.288528 | 0.267839 | 0.731042 | 0.031675 | 0.094163 | 0.010856 |
| 2 | 2.0 | 27967 | 0.253477 | 0.101982 | 7089.0 | 20878.0 | 0.120557 | 0.096912 | 0.808252 | 0.026255 | 0.077210 | 0.010856 |
| 3 | 3.0 | 10517 | 0.275554 | 0.038350 | 2898.0 | 7619.0 | 0.049284 | 0.035366 | 0.872772 | 0.022077 | 0.064520 | 0.010856 |
| 4 | 4.0 | 2849 | 0.282906 | 0.010389 | 806.0 | 2043.0 | 0.013707 | 0.009483 | 0.894204 | 0.007352 | 0.021432 | 0.010856 |
| 5 | 5.0 | 1006 | 0.290258 | 0.003668 | 292.0 | 714.0 | 0.004966 | 0.003314 | 0.915616 | 0.007352 | 0.021412 | 0.010856 |
| 6 | 6.0 | 200 | 0.230000 | 0.000729 | 46.0 | 154.0 | 0.000782 | 0.000715 | 0.739242 | 0.060258 | 0.176374 | 0.010856 |
| 7 | 7.0 | 6 | 0.166667 | 0.000022 | 1.0 | 5.0 | 0.000017 | 0.000023 | 0.549702 | 0.063333 | 0.189540 | 0.010856 |
| 8 | 8.0 | 6 | 0.000000 | 0.000022 | 0.0 | 6.0 | 0.000000 | 0.000028 | 0.000000 | 0.166667 | 0.549702 | 0.010856 |
plot_by_woe(df_temp)
# We plot the weight of evidence values.
# Categories: 0, 1 - 2, 3 - 5, >= 6
df_inputs_prepr['inq_last_6mths:0'] = np.where((df_inputs_prepr['inq_last_6mths'] == 0), 1, 0)
df_inputs_prepr['inq_last_6mths:1-2'] = np.where((df_inputs_prepr['inq_last_6mths'] >= 1) & (df_inputs_prepr['inq_last_6mths'] <= 2), 1, 0)
df_inputs_prepr['inq_last_6mths:3-5'] = np.where((df_inputs_prepr['inq_last_6mths'] >= 3) & (df_inputs_prepr['inq_last_6mths'] <= 5), 1, 0)
df_inputs_prepr['inq_last_6mths:>=6'] = np.where((df_inputs_prepr['inq_last_6mths'] >= 6), 1, 0)
Variable: 'collections_12_mths_ex_med'¶
df_inputs_prepr['collections_12_mths_ex_med'].unique()
array([ 0., 1., 2., 3., 7., 4., 6., 5., 14., 9.])
# open_acc
df_temp = woe_ordered_continuous(df_inputs_prepr, 'collections_12_mths_ex_med', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| collections_12_mths_ex_med | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 269936 | 0.213421 | 0.984327 | 57610.0 | 212326.0 | 0.979729 | 0.985582 | 0.690173 | NaN | NaN | 0.001155 |
| 1 | 1.0 | 3999 | 0.278570 | 0.014582 | 1114.0 | 2885.0 | 0.018945 | 0.013392 | 0.881566 | 0.065149 | 0.191393 | 0.001155 |
| 2 | 2.0 | 250 | 0.268000 | 0.000912 | 67.0 | 183.0 | 0.001139 | 0.000849 | 0.850727 | 0.010570 | 0.030838 | 0.001155 |
| 3 | 3.0 | 29 | 0.241379 | 0.000106 | 7.0 | 22.0 | 0.000119 | 0.000102 | 0.772752 | 0.026621 | 0.077975 | 0.001155 |
| 4 | 4.0 | 11 | 0.363636 | 0.000040 | 4.0 | 7.0 | 0.000068 | 0.000032 | 1.129314 | 0.122257 | 0.356562 | 0.001155 |
| 5 | 5.0 | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.363636 | 1.129314 | 0.001155 |
| 6 | 6.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | 0.001155 |
| 7 | 7.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | 0.001155 |
| 8 | 9.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | 0.001155 |
| 9 | 14.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | 0.001155 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '0', '1', '>=2'
df_inputs_prepr['collections_12_mths_ex_med:0'] = np.where((df_inputs_prepr['collections_12_mths_ex_med'] == 0), 1, 0)
df_inputs_prepr['collections_12_mths_ex_med:1'] = np.where((df_inputs_prepr['collections_12_mths_ex_med'] == 1), 1, 0)
df_inputs_prepr['collections_12_mths_ex_med:>=2'] = np.where((df_inputs_prepr['collections_12_mths_ex_med'] >= 2), 1, 0)
Variable: 'chargeoff_within_12_mths'¶
df_inputs_prepr['chargeoff_within_12_mths'].unique()
array([0., 1., 3., 2., 4., 5., 6., 8., 7.])
# pub_rec
df_temp = woe_ordered_continuous(df_inputs_prepr, 'chargeoff_within_12_mths', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| chargeoff_within_12_mths | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 271937 | 0.214256 | 0.991624 | 58264.0 | 213673.0 | 0.990851 | 0.991835 | 0.692651 | NaN | NaN | inf |
| 1 | 1.0 | 2080 | 0.236058 | 0.007585 | 491.0 | 1589.0 | 0.008350 | 0.007376 | 0.757096 | 0.021802 | 0.064445 | inf |
| 2 | 2.0 | 168 | 0.220238 | 0.000613 | 37.0 | 131.0 | 0.000629 | 0.000608 | 0.710388 | 0.015820 | 0.046708 | inf |
| 3 | 3.0 | 27 | 0.185185 | 0.000098 | 5.0 | 22.0 | 0.000085 | 0.000102 | 0.605766 | 0.035053 | 0.104622 | inf |
| 4 | 4.0 | 13 | 0.307692 | 0.000047 | 4.0 | 9.0 | 0.000068 | 0.000042 | 0.966339 | 0.122507 | 0.360573 | inf |
| 5 | 5.0 | 5 | 0.000000 | 0.000018 | 0.0 | 5.0 | 0.000000 | 0.000023 | 0.000000 | 0.307692 | 0.966339 | inf |
| 6 | 6.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 7 | 7.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | inf |
| 8 | 8.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories '0-2', '3-4', '>=5'
df_inputs_prepr['chargeoff_within_12_mths:0'] = np.where((df_inputs_prepr['chargeoff_within_12_mths'] == 0), 1, 0)
df_inputs_prepr['chargeoff_within_12_mths:1'] = np.where((df_inputs_prepr['chargeoff_within_12_mths'] == 1), 1, 0)
df_inputs_prepr['chargeoff_within_12_mths:>=2'] = np.where((df_inputs_prepr['chargeoff_within_12_mths'] >= 2), 1, 0)
Variable: 'total_acc'¶
# total_acc
df_inputs_prepr['total_acc_factor'] = pd.cut(df_inputs_prepr['total_acc'], 58)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr, 'total_acc_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| total_acc_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (1.856, 4.483] | 1324 | 0.270393 | 0.004828 | 358.0 | 966.0 | 0.006088 | 0.004484 | 0.857713 | NaN | NaN | inf |
| 1 | (4.483, 6.966] | 3984 | 0.231928 | 0.014528 | 924.0 | 3060.0 | 0.015714 | 0.014204 | 0.744928 | 0.038465 | 0.112786 | inf |
| 2 | (6.966, 9.448] | 12089 | 0.231367 | 0.044083 | 2797.0 | 9292.0 | 0.047566 | 0.043132 | 0.743275 | 0.000560 | 0.001652 | inf |
| 3 | (9.448, 11.931] | 12378 | 0.224592 | 0.045137 | 2780.0 | 9598.0 | 0.047277 | 0.044552 | 0.723270 | 0.006775 | 0.020005 | inf |
| 4 | (11.931, 14.414] | 23709 | 0.229449 | 0.086455 | 5440.0 | 18269.0 | 0.092514 | 0.084802 | 0.737615 | 0.004857 | 0.014345 | inf |
| 5 | (14.414, 16.897] | 17954 | 0.223293 | 0.065470 | 4009.0 | 13945.0 | 0.068178 | 0.064730 | 0.719429 | 0.006156 | 0.018187 | inf |
| 6 | (16.897, 19.379] | 29452 | 0.216148 | 0.107397 | 6366.0 | 23086.0 | 0.108262 | 0.107161 | 0.698267 | 0.007145 | 0.021161 | inf |
| 7 | (19.379, 21.862] | 19656 | 0.212810 | 0.071676 | 4183.0 | 15473.0 | 0.071137 | 0.071823 | 0.688359 | 0.003338 | 0.009908 | inf |
| 8 | (21.862, 24.345] | 28961 | 0.211629 | 0.105607 | 6129.0 | 22832.0 | 0.104231 | 0.105982 | 0.684851 | 0.001181 | 0.003509 | inf |
| 9 | (24.345, 26.828] | 17727 | 0.207875 | 0.064642 | 3685.0 | 14042.0 | 0.062668 | 0.065181 | 0.673684 | 0.003754 | 0.011167 | inf |
| 10 | (26.828, 29.31] | 23717 | 0.204748 | 0.086485 | 4856.0 | 18861.0 | 0.082582 | 0.087550 | 0.664368 | 0.003127 | 0.009316 | inf |
| 11 | (29.31, 31.793] | 13501 | 0.205614 | 0.049232 | 2776.0 | 10725.0 | 0.047209 | 0.049784 | 0.666951 | 0.000867 | 0.002583 | inf |
| 12 | (31.793, 34.276] | 17189 | 0.202106 | 0.062680 | 3474.0 | 13715.0 | 0.059080 | 0.063663 | 0.656488 | 0.003508 | 0.010463 | inf |
| 13 | (34.276, 36.759] | 9448 | 0.205758 | 0.034452 | 1944.0 | 7504.0 | 0.033060 | 0.034832 | 0.667378 | 0.003652 | 0.010891 | inf |
| 14 | (36.759, 39.241] | 11538 | 0.209828 | 0.042074 | 2421.0 | 9117.0 | 0.041172 | 0.042320 | 0.679496 | 0.004071 | 0.012118 | inf |
| 15 | (39.241, 41.724] | 5976 | 0.214692 | 0.021792 | 1283.0 | 4693.0 | 0.021819 | 0.021784 | 0.693947 | 0.004864 | 0.014450 | inf |
| 16 | (41.724, 44.207] | 7226 | 0.199972 | 0.026350 | 1445.0 | 5781.0 | 0.024574 | 0.026834 | 0.650116 | 0.014720 | 0.043831 | inf |
| 17 | (44.207, 46.69] | 3663 | 0.204750 | 0.013357 | 750.0 | 2913.0 | 0.012755 | 0.013522 | 0.664375 | 0.004778 | 0.014259 | inf |
| 18 | (46.69, 49.172] | 4301 | 0.214369 | 0.015684 | 922.0 | 3379.0 | 0.015680 | 0.015685 | 0.692987 | 0.009619 | 0.028612 | inf |
| 19 | (49.172, 51.655] | 2232 | 0.216846 | 0.008139 | 484.0 | 1748.0 | 0.008231 | 0.008114 | 0.700336 | 0.002477 | 0.007349 | inf |
| 20 | (51.655, 54.138] | 2389 | 0.226036 | 0.008712 | 540.0 | 1849.0 | 0.009183 | 0.008583 | 0.727538 | 0.009190 | 0.027202 | inf |
| 21 | (54.138, 56.621] | 1191 | 0.205709 | 0.004343 | 245.0 | 946.0 | 0.004167 | 0.004391 | 0.667234 | 0.020327 | 0.060304 | inf |
| 22 | (56.621, 59.103] | 1380 | 0.217391 | 0.005032 | 300.0 | 1080.0 | 0.005102 | 0.005013 | 0.701953 | 0.011682 | 0.034719 | inf |
| 23 | (59.103, 61.586] | 720 | 0.227778 | 0.002625 | 164.0 | 556.0 | 0.002789 | 0.002581 | 0.732683 | 0.010386 | 0.030729 | inf |
| 24 | (61.586, 64.069] | 1054 | 0.183112 | 0.003843 | 193.0 | 861.0 | 0.003282 | 0.003997 | 0.599520 | 0.044666 | 0.133163 | inf |
| 25 | (64.069, 66.552] | 288 | 0.232639 | 0.001050 | 67.0 | 221.0 | 0.001139 | 0.001026 | 0.747024 | 0.049527 | 0.147504 | inf |
| 26 | (66.552, 69.034] | 331 | 0.211480 | 0.001207 | 70.0 | 261.0 | 0.001190 | 0.001212 | 0.684408 | 0.021159 | 0.062616 | inf |
| 27 | (69.034, 71.517] | 167 | 0.209581 | 0.000609 | 35.0 | 132.0 | 0.000595 | 0.000613 | 0.678760 | 0.001900 | 0.005648 | inf |
| 28 | (71.517, 74.0] | 183 | 0.234973 | 0.000667 | 43.0 | 140.0 | 0.000731 | 0.000650 | 0.753901 | 0.025392 | 0.075141 | inf |
| 29 | (74.0, 76.483] | 98 | 0.204082 | 0.000357 | 20.0 | 78.0 | 0.000340 | 0.000362 | 0.662382 | 0.030891 | 0.091519 | inf |
| 30 | (76.483, 78.966] | 72 | 0.208333 | 0.000263 | 15.0 | 57.0 | 0.000255 | 0.000265 | 0.675048 | 0.004252 | 0.012666 | inf |
| 31 | (78.966, 81.448] | 86 | 0.232558 | 0.000314 | 20.0 | 66.0 | 0.000340 | 0.000306 | 0.746786 | 0.024225 | 0.071738 | inf |
| 32 | (81.448, 83.931] | 47 | 0.255319 | 0.000171 | 12.0 | 35.0 | 0.000204 | 0.000162 | 0.813647 | 0.022761 | 0.066860 | inf |
| 33 | (83.931, 86.414] | 59 | 0.203390 | 0.000215 | 12.0 | 47.0 | 0.000204 | 0.000218 | 0.660319 | 0.051929 | 0.153328 | inf |
| 34 | (86.414, 88.897] | 26 | 0.307692 | 0.000095 | 8.0 | 18.0 | 0.000136 | 0.000084 | 0.966339 | 0.104302 | 0.306020 | inf |
| 35 | (88.897, 91.379] | 34 | 0.294118 | 0.000124 | 10.0 | 24.0 | 0.000170 | 0.000111 | 0.926849 | 0.013575 | 0.039490 | inf |
| 36 | (91.379, 93.862] | 12 | 0.250000 | 0.000044 | 3.0 | 9.0 | 0.000051 | 0.000042 | 0.798060 | 0.044118 | 0.128789 | inf |
| 37 | (93.862, 96.345] | 26 | 0.346154 | 0.000095 | 9.0 | 17.0 | 0.000153 | 0.000079 | 1.078273 | 0.096154 | 0.280212 | inf |
| 38 | (96.345, 98.828] | 5 | 0.400000 | 0.000018 | 2.0 | 3.0 | 0.000034 | 0.000014 | 1.236185 | 0.053846 | 0.157913 | inf |
| 39 | (98.828, 101.31] | 9 | 0.222222 | 0.000033 | 2.0 | 7.0 | 0.000034 | 0.000032 | 0.716262 | 0.177778 | 0.519924 | inf |
| 40 | (101.31, 103.793] | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.222222 | 0.716262 | inf |
| 41 | (103.793, 106.276] | 7 | 0.142857 | 0.000026 | 1.0 | 6.0 | 0.000017 | 0.000028 | 0.476616 | 0.142857 | 0.476616 | inf |
| 42 | (106.276, 108.759] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.142857 | 0.476616 | inf |
| 43 | (108.759, 111.241] | 6 | 0.333333 | 0.000022 | 2.0 | 4.0 | 0.000034 | 0.000019 | 1.040928 | 0.333333 | 1.040928 | inf |
| 44 | (111.241, 113.724] | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.333333 | 1.040928 | inf |
| 45 | (113.724, 116.207] | 4 | 0.250000 | 0.000015 | 1.0 | 3.0 | 0.000017 | 0.000014 | 0.798060 | 0.250000 | 0.798060 | inf |
| 46 | (116.207, 118.69] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 47 | (118.69, 121.172] | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | NaN | NaN | inf |
| 48 | (121.172, 123.655] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 1.000000 | inf | inf |
| 49 | (123.655, 126.138] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 50 | (126.138, 128.621] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 51 | (128.621, 131.103] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 52 | (131.103, 133.586] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 53 | (133.586, 136.069] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 54 | (136.069, 138.552] | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | NaN | NaN | inf |
| 55 | (138.552, 141.034] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 1.000000 | inf | inf |
| 56 | (141.034, 143.517] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 57 | (143.517, 146.0] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '<=20', '20-56', '>=57'
df_inputs_prepr['total_acc:<=20'] = np.where((df_inputs_prepr['total_acc'] <= 20), 1, 0)
df_inputs_prepr['total_acc:21-56'] = np.where((df_inputs_prepr['total_acc'] >= 21) & (df_inputs_prepr['total_acc'] <= 56), 1, 0)
df_inputs_prepr['total_acc:>=57'] = np.where((df_inputs_prepr['total_acc'] >= 57), 1, 0)
df_inputs_prepr = df_inputs_prepr.drop(columns = ['total_acc_factor'])
# Drop the provisory feature
Variable: 'delinq_amnt'¶
# unique values
df_inputs_prepr['delinq_amnt'].nunique()
# number of observations with 0 value
df_inputs_prepr['delinq_amnt'].value_counts()[0]
# 'delinq_amnt'
df_inputs_prepr['delinq_amnt_factor'] = pd.cut(df_inputs_prepr['delinq_amnt'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# acc_now_delinq
df_temp = woe_ordered_continuous(df_inputs_prepr, 'delinq_amnt_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '0', '>=1'
df_inputs_prepr['delinq_amnt:0'] = np.where((df_inputs_prepr['delinq_amnt'] == 0), 1, 0)
df_inputs_prepr['delinq_amnt:>=1'] = np.where((df_inputs_prepr['delinq_amnt'] >= 1), 1, 0)
df_inputs_prepr = df_inputs_prepr.drop(columns = ['delinq_amnt_factor'])
# Drop the provisory feature
Variable: 'num_accts_ever_120_pd'¶
# unique values
df_inputs_prepr['num_accts_ever_120_pd'].unique()
array([ 0., 2., 1., 3., 4., 16., 7., 5., 18., 9., 11., 6., 13.,
12., 23., 10., 8., 15., 14., 26., 20., 34., 19., 17., 27., 24.,
22., 28., 25., 29., 21., 30.])
# 'num_accts_ever_120_pd'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'num_accts_ever_120_pd', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| num_accts_ever_120_pd | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 212322 | 0.209159 | 0.774237 | 44409.0 | 167913.0 | 0.755229 | 0.779425 | 0.677504 | NaN | NaN | inf |
| 1 | 1.0 | 32932 | 0.236335 | 0.120087 | 7783.0 | 25149.0 | 0.132359 | 0.116738 | 0.757914 | 0.027177 | 0.080410 | inf |
| 2 | 2.0 | 13319 | 0.228771 | 0.048568 | 3047.0 | 10272.0 | 0.051818 | 0.047681 | 0.735615 | 0.007565 | 0.022299 | inf |
| 3 | 3.0 | 6079 | 0.230137 | 0.022167 | 1399.0 | 4680.0 | 0.023792 | 0.021724 | 0.739645 | 0.001366 | 0.004030 | inf |
| 4 | 4.0 | 3578 | 0.229737 | 0.013047 | 822.0 | 2756.0 | 0.013979 | 0.012793 | 0.738467 | 0.000399 | 0.001178 | inf |
| 5 | 5.0 | 2173 | 0.218132 | 0.007924 | 474.0 | 1699.0 | 0.008061 | 0.007886 | 0.704148 | 0.011606 | 0.034319 | inf |
| 6 | 6.0 | 1368 | 0.230994 | 0.004988 | 316.0 | 1052.0 | 0.005374 | 0.004883 | 0.742175 | 0.012863 | 0.038027 | inf |
| 7 | 7.0 | 830 | 0.231325 | 0.003027 | 192.0 | 638.0 | 0.003265 | 0.002961 | 0.743151 | 0.000331 | 0.000977 | inf |
| 8 | 8.0 | 501 | 0.195609 | 0.001827 | 98.0 | 403.0 | 0.001667 | 0.001871 | 0.637064 | 0.035717 | 0.106087 | inf |
| 9 | 9.0 | 351 | 0.225071 | 0.001280 | 79.0 | 272.0 | 0.001343 | 0.001263 | 0.724687 | 0.029462 | 0.087623 | inf |
| 10 | 10.0 | 261 | 0.206897 | 0.000952 | 54.0 | 207.0 | 0.000918 | 0.000961 | 0.670771 | 0.018175 | 0.053916 | inf |
| 11 | 11.0 | 140 | 0.221429 | 0.000511 | 31.0 | 109.0 | 0.000527 | 0.000506 | 0.713913 | 0.014532 | 0.043142 | inf |
| 12 | 12.0 | 109 | 0.266055 | 0.000397 | 29.0 | 80.0 | 0.000493 | 0.000371 | 0.845046 | 0.044626 | 0.131134 | inf |
| 13 | 13.0 | 67 | 0.268657 | 0.000244 | 18.0 | 49.0 | 0.000306 | 0.000227 | 0.852645 | 0.002602 | 0.007599 | inf |
| 14 | 14.0 | 67 | 0.164179 | 0.000244 | 11.0 | 56.0 | 0.000187 | 0.000260 | 0.542122 | 0.104478 | 0.310523 | inf |
| 15 | 15.0 | 28 | 0.285714 | 0.000102 | 8.0 | 20.0 | 0.000136 | 0.000093 | 0.902384 | 0.121535 | 0.360262 | inf |
| 16 | 16.0 | 28 | 0.178571 | 0.000102 | 5.0 | 23.0 | 0.000085 | 0.000107 | 0.585814 | 0.107143 | 0.316570 | inf |
| 17 | 17.0 | 17 | 0.411765 | 0.000062 | 7.0 | 10.0 | 0.000119 | 0.000046 | 1.271046 | 0.233193 | 0.685232 | inf |
| 18 | 18.0 | 15 | 0.333333 | 0.000055 | 5.0 | 10.0 | 0.000085 | 0.000046 | 1.040928 | 0.078431 | 0.230119 | inf |
| 19 | 19.0 | 10 | 0.300000 | 0.000036 | 3.0 | 7.0 | 0.000051 | 0.000032 | 0.943965 | 0.033333 | 0.096963 | inf |
| 20 | 20.0 | 7 | 0.428571 | 0.000026 | 3.0 | 4.0 | 0.000051 | 0.000019 | 1.321159 | 0.128571 | 0.377195 | inf |
| 21 | 21.0 | 6 | 0.333333 | 0.000022 | 2.0 | 4.0 | 0.000034 | 0.000019 | 1.040928 | 0.095238 | 0.280232 | inf |
| 22 | 22.0 | 6 | 0.333333 | 0.000022 | 2.0 | 4.0 | 0.000034 | 0.000019 | 1.040928 | 0.000000 | 0.000000 | inf |
| 23 | 23.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.166667 | 0.498878 | inf |
| 24 | 24.0 | 4 | 0.000000 | 0.000015 | 0.0 | 4.0 | 0.000000 | 0.000019 | 0.000000 | 0.500000 | 1.539806 | inf |
| 25 | 25.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 26 | 26.0 | 4 | 0.250000 | 0.000015 | 1.0 | 3.0 | 0.000017 | 0.000014 | 0.798060 | 0.250000 | 0.798060 | inf |
| 27 | 27.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.250000 | 0.741746 | inf |
| 28 | 28.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.500000 | 1.539806 | inf |
| 29 | 29.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | inf |
| 30 | 30.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 31 | 34.0 | 2 | 1.000000 | 0.000007 | 2.0 | 0.0 | 0.000034 | 0.000000 | inf | 1.000000 | inf | inf |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '0', '1-11', '>=12'
df_inputs_prepr['num_accts_ever_120_pd:0'] = np.where((df_inputs_prepr['num_accts_ever_120_pd'] == 0), 1, 0)
df_inputs_prepr['num_accts_ever_120_pd:1-11'] = np.where((df_inputs_prepr['num_accts_ever_120_pd'] >= 1) & (df_inputs_prepr['num_accts_ever_120_pd'] <= 11), 1, 0)
df_inputs_prepr['num_accts_ever_120_pd:>=12'] = np.where((df_inputs_prepr['num_accts_ever_120_pd'] >= 12), 1, 0)
Variable: 'num_tl_90g_dpd_24m'¶
# unique values
df_inputs_prepr['num_tl_90g_dpd_24m'].unique()
array([ 0., 1., 2., 4., 13., 3., 9., 6., 14., 5., 7., 8., 11.,
10., 12., 18., 15., 20., 16., 36., 26., 22., 24., 17.])
# 'num_tl_90g_dpd_24m'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'num_tl_90g_dpd_24m', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| num_tl_90g_dpd_24m | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 259142 | 0.212849 | 0.944967 | 55158.0 | 203984.0 | 0.938029 | 0.946860 | 0.688473 | NaN | NaN | inf |
| 1 | 1.0 | 11266 | 0.242233 | 0.041082 | 2729.0 | 8537.0 | 0.046410 | 0.039627 | 0.775262 | 0.029385 | 0.086789 | inf |
| 2 | 2.0 | 2233 | 0.248992 | 0.008143 | 556.0 | 1677.0 | 0.009455 | 0.007784 | 0.795105 | 0.006759 | 0.019844 | inf |
| 3 | 3.0 | 623 | 0.240770 | 0.002272 | 150.0 | 473.0 | 0.002551 | 0.002196 | 0.770962 | 0.008222 | 0.024143 | inf |
| 4 | 4.0 | 349 | 0.217765 | 0.001273 | 76.0 | 273.0 | 0.001292 | 0.001267 | 0.703061 | 0.023005 | 0.067901 | inf |
| 5 | 5.0 | 199 | 0.190955 | 0.000726 | 38.0 | 161.0 | 0.000646 | 0.000747 | 0.623111 | 0.026810 | 0.079950 | inf |
| 6 | 6.0 | 147 | 0.170068 | 0.000536 | 25.0 | 122.0 | 0.000425 | 0.000566 | 0.560047 | 0.020887 | 0.063064 | inf |
| 7 | 7.0 | 71 | 0.281690 | 0.000259 | 20.0 | 51.0 | 0.000340 | 0.000237 | 0.890661 | 0.111622 | 0.330614 | inf |
| 8 | 8.0 | 45 | 0.266667 | 0.000164 | 12.0 | 33.0 | 0.000204 | 0.000153 | 0.846833 | 0.015023 | 0.043828 | inf |
| 9 | 9.0 | 53 | 0.169811 | 0.000193 | 9.0 | 44.0 | 0.000153 | 0.000204 | 0.559267 | 0.096855 | 0.287566 | inf |
| 10 | 10.0 | 29 | 0.241379 | 0.000106 | 7.0 | 22.0 | 0.000119 | 0.000102 | 0.772752 | 0.071568 | 0.213485 | inf |
| 11 | 11.0 | 15 | 0.333333 | 0.000055 | 5.0 | 10.0 | 0.000085 | 0.000046 | 1.040928 | 0.091954 | 0.268176 | inf |
| 12 | 12.0 | 17 | 0.176471 | 0.000062 | 3.0 | 14.0 | 0.000051 | 0.000065 | 0.579461 | 0.156863 | 0.461467 | inf |
| 13 | 13.0 | 14 | 0.214286 | 0.000051 | 3.0 | 11.0 | 0.000051 | 0.000051 | 0.692740 | 0.037815 | 0.113280 | inf |
| 14 | 14.0 | 12 | 0.500000 | 0.000044 | 6.0 | 6.0 | 0.000102 | 0.000028 | 1.539806 | 0.285714 | 0.847065 | inf |
| 15 | 15.0 | 5 | 0.200000 | 0.000018 | 1.0 | 4.0 | 0.000017 | 0.000019 | 0.650199 | 0.300000 | 0.889607 | inf |
| 16 | 16.0 | 3 | 0.333333 | 0.000011 | 1.0 | 2.0 | 0.000017 | 0.000009 | 1.040928 | 0.133333 | 0.390729 | inf |
| 17 | 17.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.333333 | 1.040928 | inf |
| 18 | 18.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | inf |
| 19 | 20.0 | 3 | 0.333333 | 0.000011 | 1.0 | 2.0 | 0.000017 | 0.000009 | 1.040928 | 0.333333 | 1.040928 | inf |
| 20 | 22.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.333333 | 1.040928 | inf |
| 21 | 24.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 22 | 26.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.500000 | 1.539806 | inf |
| 23 | 36.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 0.500000 | inf | inf |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '0', '1-4', '>=5'
df_inputs_prepr['num_tl_90g_dpd_24m:0'] = np.where((df_inputs_prepr['num_tl_90g_dpd_24m'] == 0), 1, 0)
df_inputs_prepr['num_tl_90g_dpd_24m:1-4'] = np.where((df_inputs_prepr['num_tl_90g_dpd_24m'] >= 1) & (df_inputs_prepr['num_tl_90g_dpd_24m'] <= 4), 1, 0)
df_inputs_prepr['num_tl_90g_dpd_24m:>=5'] = np.where((df_inputs_prepr['num_tl_90g_dpd_24m'] >= 5), 1, 0)
Variable: 'revol_bal'¶
# unique values
df_inputs_prepr['revol_bal'].unique()
array([ 11405., 30808., 16940., ..., 87095., 155670., 34577.])
# 'revol_bal'
df_inputs_prepr['revol_bal_factor'] = pd.cut(df_inputs_prepr['revol_bal'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# 'revol_bal'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'revol_bal_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| revol_bal_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-2904.836, 58096.72] | 267276 | 0.215605 | 0.974628 | 57626.0 | 209650.0 | 0.980001 | 0.973161 | 0.696655 | NaN | NaN | 0.001207 |
| 1 | (58096.72, 116193.44] | 5128 | 0.177457 | 0.018699 | 910.0 | 4218.0 | 0.015476 | 0.019579 | 0.582445 | 0.038148 | 0.114210 | 0.001207 |
| 2 | (116193.44, 174290.16] | 1030 | 0.151456 | 0.003756 | 156.0 | 874.0 | 0.002653 | 0.004057 | 0.503154 | 0.026001 | 0.079291 | 0.001207 |
| 3 | (174290.16, 232386.88] | 402 | 0.131841 | 0.001466 | 53.0 | 349.0 | 0.000901 | 0.001620 | 0.442360 | 0.019616 | 0.060794 | 0.001207 |
| 4 | (232386.88, 290483.6] | 198 | 0.111111 | 0.000722 | 22.0 | 176.0 | 0.000374 | 0.000817 | 0.377039 | 0.020730 | 0.065322 | 0.001207 |
| 5 | (290483.6, 348580.32] | 86 | 0.139535 | 0.000314 | 12.0 | 74.0 | 0.000204 | 0.000343 | 0.466316 | 0.028424 | 0.089278 | 0.001207 |
| 6 | (348580.32, 406677.04] | 45 | 0.177778 | 0.000164 | 8.0 | 37.0 | 0.000136 | 0.000172 | 0.583415 | 0.038243 | 0.117099 | 0.001207 |
| 7 | (406677.04, 464773.76] | 29 | 0.275862 | 0.000106 | 8.0 | 21.0 | 0.000136 | 0.000097 | 0.873671 | 0.098084 | 0.290256 | 0.001207 |
| 8 | (464773.76, 522870.48] | 10 | 0.300000 | 0.000036 | 3.0 | 7.0 | 0.000051 | 0.000032 | 0.943965 | 0.024138 | 0.070293 | 0.001207 |
| 9 | (522870.48, 580967.2] | 11 | 0.181818 | 0.000040 | 2.0 | 9.0 | 0.000034 | 0.000042 | 0.595618 | 0.118182 | 0.348346 | 0.001207 |
| 10 | (580967.2, 639063.92] | 6 | 0.000000 | 0.000022 | 0.0 | 6.0 | 0.000000 | 0.000028 | 0.000000 | 0.181818 | 0.595618 | 0.001207 |
| 11 | (639063.92, 697160.64] | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.000000 | 0.000000 | 0.001207 |
| 12 | (697160.64, 755257.36] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | 0.001207 |
| 13 | (755257.36, 813354.08] | 3 | 0.666667 | 0.000011 | 2.0 | 1.0 | 0.000034 | 0.000005 | 2.119548 | 0.666667 | 2.119548 | 0.001207 |
| 14 | (813354.08, 871450.8] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.666667 | 2.119548 | 0.001207 |
| 15 | (871450.8, 929547.52] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | 0.001207 |
| 16 | (929547.52, 987644.24] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 17 | (987644.24, 1045740.96] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 18 | (1045740.96, 1103837.68] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 19 | (1103837.68, 1161934.4] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 20 | (1161934.4, 1220031.12] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 21 | (1220031.12, 1278127.84] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 22 | (1278127.84, 1336224.56] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 23 | (1336224.56, 1394321.28] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 24 | (1394321.28, 1452418.0] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 25 | (1452418.0, 1510514.72] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 26 | (1510514.72, 1568611.44] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 27 | (1568611.44, 1626708.16] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 28 | (1626708.16, 1684804.88] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 29 | (1684804.88, 1742901.6] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 30 | (1742901.6, 1800998.32] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 31 | (1800998.32, 1859095.04] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 32 | (1859095.04, 1917191.76] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 33 | (1917191.76, 1975288.48] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 34 | (1975288.48, 2033385.2] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 35 | (2033385.2, 2091481.92] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 36 | (2091481.92, 2149578.64] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 37 | (2149578.64, 2207675.36] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 38 | (2207675.36, 2265772.08] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 39 | (2265772.08, 2323868.8] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 40 | (2323868.8, 2381965.52] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 41 | (2381965.52, 2440062.24] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 42 | (2440062.24, 2498158.96] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 43 | (2498158.96, 2556255.68] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 44 | (2556255.68, 2614352.4] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | NaN | NaN | 0.001207 |
| 45 | (2614352.4, 2672449.12] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 46 | (2672449.12, 2730545.84] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 47 | (2730545.84, 2788642.56] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 48 | (2788642.56, 2846739.28] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.001207 |
| 49 | (2846739.28, 2904836.0] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | 0.001207 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# one category will be created for 'revol_bal' > 100000.
#***********************************************************************************************
# the categories of everyone with 'revol_bal' less 100000.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['revol_bal'] <= 100000, : ]
#df_inputs_prepr_temp
df_inputs_prepr_temp['revol_bal_factor'] = pd.cut(df_inputs_prepr_temp['revol_bal'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'revol_bal_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3812317764.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['revol_bal_factor'] = pd.cut(df_inputs_prepr_temp['revol_bal'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| revol_bal_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-99.967, 1999.34] | 16936 | 0.217820 | 0.062315 | 3689.0 | 13247.0 | 0.063122 | 0.062094 | 0.701395 | NaN | NaN | 0.001701 |
| 1 | (1999.34, 3998.68] | 23741 | 0.212839 | 0.087354 | 5053.0 | 18688.0 | 0.086462 | 0.087598 | 0.686640 | 0.004981 | 0.014754 | 0.001701 |
| 2 | (3998.68, 5998.02] | 28965 | 0.211289 | 0.106575 | 6120.0 | 22845.0 | 0.104719 | 0.107084 | 0.682046 | 0.001549 | 0.004594 | 0.001701 |
| 3 | (5998.02, 7997.36] | 28485 | 0.215025 | 0.104809 | 6125.0 | 22360.0 | 0.104805 | 0.104810 | 0.693121 | 0.003736 | 0.011075 | 0.001701 |
| 4 | (7997.36, 9996.7] | 25966 | 0.219749 | 0.095541 | 5706.0 | 20260.0 | 0.097635 | 0.094967 | 0.707100 | 0.004723 | 0.013978 | 0.001701 |
| 5 | (9996.7, 11996.04] | 22836 | 0.222762 | 0.084024 | 5087.0 | 17749.0 | 0.087044 | 0.083197 | 0.716004 | 0.003013 | 0.008904 | 0.001701 |
| 6 | (11996.04, 13995.38] | 19075 | 0.220708 | 0.070185 | 4210.0 | 14865.0 | 0.072037 | 0.069678 | 0.709934 | 0.002055 | 0.006070 | 0.001701 |
| 7 | (13995.38, 15994.72] | 16227 | 0.225057 | 0.059706 | 3652.0 | 12575.0 | 0.062489 | 0.058944 | 0.722777 | 0.004349 | 0.012843 | 0.001701 |
| 8 | (15994.72, 17994.06] | 13461 | 0.224203 | 0.049529 | 3018.0 | 10443.0 | 0.051641 | 0.048950 | 0.720258 | 0.000854 | 0.002519 | 0.001701 |
| 9 | (17994.06, 19993.4] | 11389 | 0.219773 | 0.041905 | 2503.0 | 8886.0 | 0.042829 | 0.041652 | 0.707172 | 0.004430 | 0.013086 | 0.001701 |
| 10 | (19993.4, 21992.74] | 9359 | 0.213698 | 0.034436 | 2000.0 | 7359.0 | 0.034222 | 0.034495 | 0.689188 | 0.006075 | 0.017984 | 0.001701 |
| 11 | (21992.74, 23992.08] | 7979 | 0.212683 | 0.029358 | 1697.0 | 6282.0 | 0.029037 | 0.029446 | 0.686180 | 0.001015 | 0.003008 | 0.001701 |
| 12 | (23992.08, 25991.42] | 6614 | 0.217569 | 0.024336 | 1439.0 | 5175.0 | 0.024623 | 0.024257 | 0.700651 | 0.004885 | 0.014471 | 0.001701 |
| 13 | (25991.42, 27990.76] | 5734 | 0.206836 | 0.021098 | 1186.0 | 4548.0 | 0.020294 | 0.021318 | 0.668821 | 0.010732 | 0.031830 | 0.001701 |
| 14 | (27990.76, 29990.1] | 4853 | 0.208325 | 0.017856 | 1011.0 | 3842.0 | 0.017299 | 0.018009 | 0.673244 | 0.001488 | 0.004423 | 0.001701 |
| 15 | (29990.1, 31989.44] | 4215 | 0.207117 | 0.015509 | 873.0 | 3342.0 | 0.014938 | 0.015665 | 0.669657 | 0.001207 | 0.003588 | 0.001701 |
| 16 | (31989.44, 33988.78] | 3551 | 0.193467 | 0.013066 | 687.0 | 2864.0 | 0.011755 | 0.013425 | 0.628951 | 0.013651 | 0.040705 | 0.001701 |
| 17 | (33988.78, 35988.12] | 3121 | 0.210189 | 0.011484 | 656.0 | 2465.0 | 0.011225 | 0.011554 | 0.678780 | 0.016722 | 0.049829 | 0.001701 |
| 18 | (35988.12, 37987.46] | 2619 | 0.195494 | 0.009636 | 512.0 | 2107.0 | 0.008761 | 0.009876 | 0.635015 | 0.014695 | 0.043765 | 0.001701 |
| 19 | (37987.46, 39986.8] | 2186 | 0.208600 | 0.008043 | 456.0 | 1730.0 | 0.007803 | 0.008109 | 0.674062 | 0.013106 | 0.039047 | 0.001701 |
| 20 | (39986.8, 41986.14] | 1907 | 0.202412 | 0.007017 | 386.0 | 1521.0 | 0.006605 | 0.007130 | 0.655656 | 0.006188 | 0.018406 | 0.001701 |
| 21 | (41986.14, 43985.48] | 1626 | 0.178352 | 0.005983 | 290.0 | 1336.0 | 0.004962 | 0.006262 | 0.583546 | 0.024060 | 0.072110 | 0.001701 |
| 22 | (43985.48, 45984.82] | 1421 | 0.192118 | 0.005228 | 273.0 | 1148.0 | 0.004671 | 0.005381 | 0.624916 | 0.013766 | 0.041370 | 0.001701 |
| 23 | (45984.82, 47984.16] | 1257 | 0.171838 | 0.004625 | 216.0 | 1041.0 | 0.003696 | 0.004880 | 0.563856 | 0.020281 | 0.061059 | 0.001701 |
| 24 | (47984.16, 49983.5] | 1083 | 0.228994 | 0.003985 | 248.0 | 835.0 | 0.004244 | 0.003914 | 0.734384 | 0.057156 | 0.170528 | 0.001701 |
| 25 | (49983.5, 51982.84] | 815 | 0.218405 | 0.002999 | 178.0 | 637.0 | 0.003046 | 0.002986 | 0.703125 | 0.010589 | 0.031259 | 0.001701 |
| 26 | (51982.84, 53982.18] | 698 | 0.206304 | 0.002568 | 144.0 | 554.0 | 0.002464 | 0.002597 | 0.667238 | 0.012101 | 0.035887 | 0.001701 |
| 27 | (53982.18, 55981.52] | 608 | 0.177632 | 0.002237 | 108.0 | 500.0 | 0.001848 | 0.002344 | 0.581372 | 0.028672 | 0.085865 | 0.001701 |
| 28 | (55981.52, 57980.86] | 515 | 0.182524 | 0.001895 | 94.0 | 421.0 | 0.001608 | 0.001973 | 0.596118 | 0.004893 | 0.014745 | 0.001701 |
| 29 | (57980.86, 59980.2] | 455 | 0.184615 | 0.001674 | 84.0 | 371.0 | 0.001437 | 0.001739 | 0.602407 | 0.002091 | 0.006290 | 0.001701 |
| 30 | (59980.2, 61979.54] | 394 | 0.190355 | 0.001450 | 75.0 | 319.0 | 0.001283 | 0.001495 | 0.619635 | 0.005740 | 0.017228 | 0.001701 |
| 31 | (61979.54, 63978.88] | 379 | 0.195251 | 0.001395 | 74.0 | 305.0 | 0.001266 | 0.001430 | 0.634287 | 0.004895 | 0.014651 | 0.001701 |
| 32 | (63978.88, 65978.22] | 351 | 0.170940 | 0.001291 | 60.0 | 291.0 | 0.001027 | 0.001364 | 0.561137 | 0.024310 | 0.073149 | 0.001701 |
| 33 | (65978.22, 67977.56] | 296 | 0.199324 | 0.001089 | 59.0 | 237.0 | 0.001010 | 0.001111 | 0.646451 | 0.028384 | 0.085314 | 0.001701 |
| 34 | (67977.56, 69976.9] | 269 | 0.226766 | 0.000990 | 61.0 | 208.0 | 0.001044 | 0.000975 | 0.727817 | 0.027441 | 0.081366 | 0.001701 |
| 35 | (69976.9, 71976.24] | 261 | 0.176245 | 0.000960 | 46.0 | 215.0 | 0.000787 | 0.001008 | 0.577187 | 0.050521 | 0.150631 | 0.001701 |
| 36 | (71976.24, 73975.58] | 185 | 0.140541 | 0.000681 | 26.0 | 159.0 | 0.000445 | 0.000745 | 0.468080 | 0.035705 | 0.109107 | 0.001701 |
| 37 | (73975.58, 75974.92] | 251 | 0.187251 | 0.000924 | 47.0 | 204.0 | 0.000804 | 0.000956 | 0.610325 | 0.046710 | 0.142245 | 0.001701 |
| 38 | (75974.92, 77974.26] | 187 | 0.219251 | 0.000688 | 41.0 | 146.0 | 0.000702 | 0.000684 | 0.705628 | 0.032000 | 0.095304 | 0.001701 |
| 39 | (77974.26, 79973.6] | 187 | 0.155080 | 0.000688 | 29.0 | 158.0 | 0.000496 | 0.000741 | 0.512832 | 0.064171 | 0.192796 | 0.001701 |
| 40 | (79973.6, 81972.94] | 157 | 0.184713 | 0.000578 | 29.0 | 128.0 | 0.000496 | 0.000600 | 0.602702 | 0.029633 | 0.089870 | 0.001701 |
| 41 | (81972.94, 83972.28] | 183 | 0.213115 | 0.000673 | 39.0 | 144.0 | 0.000667 | 0.000675 | 0.687459 | 0.028401 | 0.084757 | 0.001701 |
| 42 | (83972.28, 85971.62] | 131 | 0.091603 | 0.000482 | 12.0 | 119.0 | 0.000205 | 0.000558 | 0.313430 | 0.121512 | 0.374029 | 0.001701 |
| 43 | (85971.62, 87970.96] | 172 | 0.174419 | 0.000633 | 30.0 | 142.0 | 0.000513 | 0.000666 | 0.571666 | 0.082816 | 0.258236 | 0.001701 |
| 44 | (87970.96, 89970.3] | 112 | 0.178571 | 0.000412 | 20.0 | 92.0 | 0.000342 | 0.000431 | 0.584208 | 0.004153 | 0.012542 | 0.001701 |
| 45 | (89970.3, 91969.64] | 131 | 0.213740 | 0.000482 | 28.0 | 103.0 | 0.000479 | 0.000483 | 0.689314 | 0.035169 | 0.105106 | 0.001701 |
| 46 | (91969.64, 93968.98] | 116 | 0.155172 | 0.000427 | 18.0 | 98.0 | 0.000308 | 0.000459 | 0.513114 | 0.058568 | 0.176199 | 0.001701 |
| 47 | (93968.98, 95968.32] | 105 | 0.171429 | 0.000386 | 18.0 | 87.0 | 0.000308 | 0.000408 | 0.562617 | 0.016256 | 0.049502 | 0.001701 |
| 48 | (95968.32, 97967.66] | 118 | 0.169492 | 0.000434 | 20.0 | 98.0 | 0.000342 | 0.000459 | 0.556746 | 0.001937 | 0.005871 | 0.001701 |
| 49 | (97967.66, 99967.0] | 98 | 0.091837 | 0.000361 | 9.0 | 89.0 | 0.000154 | 0.000417 | 0.314186 | 0.077655 | 0.242560 | 0.001701 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '<= 8000', '8000-22000', '22000-35000', '35000-60000', '60000-100000', > 100000
df_inputs_prepr['revol_bal:<=8k'] = np.where((df_inputs_prepr['revol_bal'] <= 8000.), 1, 0)
df_inputs_prepr['revol_bal:8-22k'] = np.where((df_inputs_prepr['revol_bal'] > 8000.) & (df_inputs_prepr['revol_bal'] <= 22000.), 1, 0)
df_inputs_prepr['revol_bal:22-35k'] = np.where((df_inputs_prepr['revol_bal'] > 22000.) & (df_inputs_prepr['revol_bal'] <= 35000.), 1, 0)
df_inputs_prepr['revol_bal:35-60k'] = np.where((df_inputs_prepr['revol_bal'] > 35000.) & (df_inputs_prepr['revol_bal'] <= 60000.), 1, 0)
df_inputs_prepr['revol_bal:60-100k'] = np.where((df_inputs_prepr['revol_bal'] > 60000.) & (df_inputs_prepr['revol_bal'] <= 100000.), 1, 0)
df_inputs_prepr['revol_bal:>100k'] = np.where((df_inputs_prepr['revol_bal'] > 100000.), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3187895681.py:4: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['revol_bal:22-35k'] = np.where((df_inputs_prepr['revol_bal'] > 22000.) & (df_inputs_prepr['revol_bal'] <= 35000.), 1, 0) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3187895681.py:5: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['revol_bal:35-60k'] = np.where((df_inputs_prepr['revol_bal'] > 35000.) & (df_inputs_prepr['revol_bal'] <= 60000.), 1, 0) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3187895681.py:6: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['revol_bal:60-100k'] = np.where((df_inputs_prepr['revol_bal'] > 60000.) & (df_inputs_prepr['revol_bal'] <= 100000.), 1, 0) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3187895681.py:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['revol_bal:>100k'] = np.where((df_inputs_prepr['revol_bal'] > 100000.), 1, 0)
df_inputs_prepr = df_inputs_prepr.copy()
df_inputs_prepr = df_inputs_prepr.drop(columns = ['revol_bal_factor'])
# Drop the provisory feature
Variable: 'total_bal_il'¶
# unique values
df_inputs_prepr['total_bal_il'].unique()
array([ 0., 61045., 7321., ..., 72775., 108273., 919.])
# number of observations with 0 value
df_inputs_prepr['total_bal_il'].value_counts()[0]
173265
# 'revol_bal'
df_inputs_prepr['total_bal_il_factor'] = pd.cut(df_inputs_prepr['total_bal_il'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# 'revol_bal'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'total_bal_il_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| total_bal_il_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-1044.916, 20898.32] | 212152 | 0.202082 | 0.773617 | 42872.0 | 169280.0 | 0.729091 | 0.785770 | 0.656415 | NaN | NaN | 0.009611 |
| 1 | (20898.32, 41796.64] | 29386 | 0.261077 | 0.107157 | 7672.0 | 21714.0 | 0.130472 | 0.100793 | 0.830495 | 0.058995 | 0.174080 | 0.009611 |
| 2 | (41796.64, 62694.96] | 14531 | 0.259032 | 0.052988 | 3764.0 | 10767.0 | 0.064011 | 0.049979 | 0.824516 | 0.002044 | 0.005980 | 0.009611 |
| 3 | (62694.96, 83593.28] | 7465 | 0.255325 | 0.027221 | 1906.0 | 5559.0 | 0.032414 | 0.025804 | 0.813663 | 0.003708 | 0.010852 | 0.009611 |
| 4 | (83593.28, 104491.6] | 4013 | 0.238973 | 0.014633 | 959.0 | 3054.0 | 0.016309 | 0.014176 | 0.765677 | 0.016352 | 0.047986 | 0.009611 |
| 5 | (104491.6, 125389.92] | 2247 | 0.252781 | 0.008194 | 568.0 | 1679.0 | 0.009660 | 0.007794 | 0.806213 | 0.013808 | 0.040536 | 0.009611 |
| 6 | (125389.92, 146288.24] | 1410 | 0.248936 | 0.005142 | 351.0 | 1059.0 | 0.005969 | 0.004916 | 0.794940 | 0.003845 | 0.011273 | 0.009611 |
| 7 | (146288.24, 167186.56] | 930 | 0.252688 | 0.003391 | 235.0 | 695.0 | 0.003996 | 0.003226 | 0.805940 | 0.003752 | 0.011000 | 0.009611 |
| 8 | (167186.56, 188084.88] | 612 | 0.215686 | 0.002232 | 132.0 | 480.0 | 0.002245 | 0.002228 | 0.696897 | 0.037002 | 0.109043 | 0.009611 |
| 9 | (188084.88, 208983.2] | 411 | 0.243309 | 0.001499 | 100.0 | 311.0 | 0.001701 | 0.001444 | 0.778423 | 0.027623 | 0.081526 | 0.009611 |
| 10 | (208983.2, 229881.52] | 276 | 0.260870 | 0.001006 | 72.0 | 204.0 | 0.001224 | 0.000947 | 0.829889 | 0.017561 | 0.051467 | 0.009611 |
| 11 | (229881.52, 250779.84] | 179 | 0.268156 | 0.000653 | 48.0 | 131.0 | 0.000816 | 0.000608 | 0.851184 | 0.007287 | 0.021295 | 0.009611 |
| 12 | (250779.84, 271678.16] | 160 | 0.231250 | 0.000583 | 37.0 | 123.0 | 0.000629 | 0.000571 | 0.742929 | 0.036906 | 0.108255 | 0.009611 |
| 13 | (271678.16, 292576.48] | 100 | 0.180000 | 0.000365 | 18.0 | 82.0 | 0.000306 | 0.000381 | 0.590130 | 0.051250 | 0.152799 | 0.009611 |
| 14 | (292576.48, 313474.8] | 87 | 0.160920 | 0.000317 | 14.0 | 73.0 | 0.000238 | 0.000339 | 0.532171 | 0.019080 | 0.057959 | 0.009611 |
| 15 | (313474.8, 334373.12] | 54 | 0.129630 | 0.000197 | 7.0 | 47.0 | 0.000119 | 0.000218 | 0.435448 | 0.031290 | 0.096723 | 0.009611 |
| 16 | (334373.12, 355271.44] | 48 | 0.229167 | 0.000175 | 11.0 | 37.0 | 0.000187 | 0.000172 | 0.736783 | 0.099537 | 0.301335 | 0.009611 |
| 17 | (355271.44, 376169.76] | 33 | 0.212121 | 0.000120 | 7.0 | 26.0 | 0.000119 | 0.000121 | 0.686312 | 0.017045 | 0.050471 | 0.009611 |
| 18 | (376169.76, 397068.08] | 29 | 0.206897 | 0.000106 | 6.0 | 23.0 | 0.000102 | 0.000107 | 0.670771 | 0.005225 | 0.015542 | 0.009611 |
| 19 | (397068.08, 417966.4] | 15 | 0.200000 | 0.000055 | 3.0 | 12.0 | 0.000051 | 0.000056 | 0.650199 | 0.006897 | 0.020572 | 0.009611 |
| 20 | (417966.4, 438864.72] | 25 | 0.160000 | 0.000091 | 4.0 | 21.0 | 0.000068 | 0.000097 | 0.529360 | 0.040000 | 0.120839 | 0.009611 |
| 21 | (438864.72, 459763.04] | 11 | 0.272727 | 0.000040 | 3.0 | 8.0 | 0.000051 | 0.000037 | 0.864527 | 0.112727 | 0.335167 | 0.009611 |
| 22 | (459763.04, 480661.36] | 8 | 0.375000 | 0.000029 | 3.0 | 5.0 | 0.000051 | 0.000023 | 1.162592 | 0.102273 | 0.298065 | 0.009611 |
| 23 | (480661.36, 501559.68] | 10 | 0.100000 | 0.000036 | 1.0 | 9.0 | 0.000017 | 0.000042 | 0.341514 | 0.275000 | 0.821078 | 0.009611 |
| 24 | (501559.68, 522458.0] | 13 | 0.307692 | 0.000047 | 4.0 | 9.0 | 0.000068 | 0.000042 | 0.966339 | 0.207692 | 0.624825 | 0.009611 |
| 25 | (522458.0, 543356.32] | 7 | 0.428571 | 0.000026 | 3.0 | 4.0 | 0.000051 | 0.000019 | 1.321159 | 0.120879 | 0.354820 | 0.009611 |
| 26 | (543356.32, 564254.64] | 7 | 0.285714 | 0.000026 | 2.0 | 5.0 | 0.000034 | 0.000023 | 0.902384 | 0.142857 | 0.418775 | 0.009611 |
| 27 | (564254.64, 585152.96] | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.285714 | 0.902384 | 0.009611 |
| 28 | (585152.96, 606051.28] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | 0.009611 |
| 29 | (606051.28, 626949.6] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | 0.009611 |
| 30 | (626949.6, 647847.92] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 31 | (647847.92, 668746.24] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 32 | (668746.24, 689644.56] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 33 | (689644.56, 710542.88] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | NaN | NaN | 0.009611 |
| 34 | (710542.88, 731441.2] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | 0.009611 |
| 35 | (731441.2, 752339.52] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 36 | (752339.52, 773237.84] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 37 | (773237.84, 794136.16] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | 0.009611 |
| 38 | (794136.16, 815034.48] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 39 | (815034.48, 835932.8] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 40 | (835932.8, 856831.12] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 41 | (856831.12, 877729.44] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 42 | (877729.44, 898627.76] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | NaN | NaN | 0.009611 |
| 43 | (898627.76, 919526.08] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 44 | (919526.08, 940424.4] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 45 | (940424.4, 961322.72] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 46 | (961322.72, 982221.04] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 47 | (982221.04, 1003119.36] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 48 | (1003119.36, 1024017.68] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.009611 |
| 49 | (1024017.68, 1044916.0] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | 0.009611 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# one category will be created for 'total_bal_il' = 0 with a count value = 696066.
# one other category will be created for 'total_bal_il' > 200000.
#***********************************************************************************************
# the categories of everyone with 'total_bal_il' different of 0 and less than 200000.
df_inputs_prepr_temp = df_inputs_prepr.loc[(df_inputs_prepr['total_bal_il'] != 0) & (df_inputs_prepr['total_bal_il'] <= 200000), : ]
#df_inputs_prepr_temp
df_inputs_prepr_temp['total_bal_il_factor'] = pd.cut(df_inputs_prepr_temp['total_bal_il'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'total_bal_il_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1452105147.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['total_bal_il_factor'] = pd.cut(df_inputs_prepr_temp['total_bal_il'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| total_bal_il_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-198.948, 3999.96] | 5316 | 0.245109 | 0.053300 | 1303.0 | 4013.0 | 0.050998 | 0.054093 | 0.664122 | NaN | NaN | 0.001671 |
| 1 | (3999.96, 7998.92] | 7607 | 0.245958 | 0.076271 | 1871.0 | 5736.0 | 0.073229 | 0.077318 | 0.666347 | 0.000849 | 0.002226 | 0.001671 |
| 2 | (7998.92, 11997.88] | 7994 | 0.260195 | 0.080151 | 2080.0 | 5914.0 | 0.081409 | 0.079717 | 0.703701 | 0.014237 | 0.037353 | 0.001671 |
| 3 | (11997.88, 15996.84] | 8308 | 0.258305 | 0.083299 | 2146.0 | 6162.0 | 0.083992 | 0.083060 | 0.698741 | 0.001890 | 0.004960 | 0.001671 |
| 4 | (15996.84, 19995.8] | 7995 | 0.259162 | 0.080161 | 2072.0 | 5923.0 | 0.081096 | 0.079839 | 0.700989 | 0.000857 | 0.002248 | 0.001671 |
| 5 | (19995.8, 23994.76] | 7196 | 0.259728 | 0.072150 | 1869.0 | 5327.0 | 0.073151 | 0.071805 | 0.702474 | 0.000566 | 0.001485 | 0.001671 |
| 6 | (23994.76, 27993.72] | 6462 | 0.250851 | 0.064790 | 1621.0 | 4841.0 | 0.063444 | 0.065254 | 0.679183 | 0.008876 | 0.023291 | 0.001671 |
| 7 | (27993.72, 31992.68] | 5914 | 0.263274 | 0.059296 | 1557.0 | 4357.0 | 0.060939 | 0.058730 | 0.711782 | 0.012422 | 0.032599 | 0.001671 |
| 8 | (31992.68, 35991.64] | 5061 | 0.265165 | 0.050743 | 1342.0 | 3719.0 | 0.052524 | 0.050130 | 0.716748 | 0.001891 | 0.004966 | 0.001671 |
| 9 | (35991.64, 39990.6] | 4527 | 0.261763 | 0.045389 | 1185.0 | 3342.0 | 0.046380 | 0.045048 | 0.707816 | 0.003402 | 0.008933 | 0.001671 |
| 10 | (39990.6, 43989.56] | 3971 | 0.270964 | 0.039815 | 1076.0 | 2895.0 | 0.042114 | 0.039023 | 0.731982 | 0.009202 | 0.024166 | 0.001671 |
| 11 | (43989.56, 47988.52] | 3394 | 0.266352 | 0.034029 | 904.0 | 2490.0 | 0.035382 | 0.033564 | 0.719866 | 0.004612 | 0.012115 | 0.001671 |
| 12 | (47988.52, 51987.48] | 2894 | 0.252937 | 0.029016 | 732.0 | 2162.0 | 0.028650 | 0.029143 | 0.684655 | 0.013415 | 0.035211 | 0.001671 |
| 13 | (51987.48, 55986.44] | 2548 | 0.263736 | 0.025547 | 672.0 | 1876.0 | 0.026301 | 0.025287 | 0.712997 | 0.010799 | 0.028342 | 0.001671 |
| 14 | (55986.44, 59985.4] | 2207 | 0.241504 | 0.022128 | 533.0 | 1674.0 | 0.020861 | 0.022565 | 0.654668 | 0.022232 | 0.058329 | 0.001671 |
| 15 | (59985.4, 63984.36] | 1995 | 0.260652 | 0.020003 | 520.0 | 1475.0 | 0.020352 | 0.019882 | 0.704899 | 0.019147 | 0.050231 | 0.001671 |
| 16 | (63984.36, 67983.32] | 1719 | 0.253054 | 0.017235 | 435.0 | 1284.0 | 0.017025 | 0.017308 | 0.684962 | 0.007598 | 0.019937 | 0.001671 |
| 17 | (67983.32, 71982.28] | 1582 | 0.256005 | 0.015862 | 405.0 | 1177.0 | 0.015851 | 0.015865 | 0.692705 | 0.002951 | 0.007743 | 0.001671 |
| 18 | (71982.28, 75981.24] | 1384 | 0.256503 | 0.013876 | 355.0 | 1029.0 | 0.013894 | 0.013870 | 0.694011 | 0.000498 | 0.001306 | 0.001671 |
| 19 | (75981.24, 79980.2] | 1206 | 0.252073 | 0.012092 | 304.0 | 902.0 | 0.011898 | 0.012158 | 0.682388 | 0.004430 | 0.011623 | 0.001671 |
| 20 | (79980.2, 83979.16] | 1094 | 0.269653 | 0.010969 | 295.0 | 799.0 | 0.011546 | 0.010770 | 0.728535 | 0.017580 | 0.046147 | 0.001671 |
| 21 | (83979.16, 87978.12] | 972 | 0.215021 | 0.009746 | 209.0 | 763.0 | 0.008180 | 0.010285 | 0.585200 | 0.054632 | 0.143335 | 0.001671 |
| 22 | (87978.12, 91977.08] | 857 | 0.252042 | 0.008593 | 216.0 | 641.0 | 0.008454 | 0.008640 | 0.682307 | 0.037021 | 0.097107 | 0.001671 |
| 23 | (91977.08, 95976.04] | 764 | 0.246073 | 0.007660 | 188.0 | 576.0 | 0.007358 | 0.007764 | 0.666651 | 0.005969 | 0.015656 | 0.001671 |
| 24 | (95976.04, 99975.0] | 661 | 0.210287 | 0.006627 | 139.0 | 522.0 | 0.005440 | 0.007036 | 0.572775 | 0.035786 | 0.093876 | 0.001671 |
| 25 | (99975.0, 103973.96] | 595 | 0.262185 | 0.005966 | 156.0 | 439.0 | 0.006106 | 0.005917 | 0.708924 | 0.051897 | 0.136149 | 0.001671 |
| 26 | (103973.96, 107972.92] | 507 | 0.230769 | 0.005083 | 117.0 | 390.0 | 0.004579 | 0.005257 | 0.626516 | 0.031416 | 0.082408 | 0.001671 |
| 27 | (107972.92, 111971.88] | 462 | 0.259740 | 0.004632 | 120.0 | 342.0 | 0.004697 | 0.004610 | 0.702507 | 0.028971 | 0.075991 | 0.001671 |
| 28 | (111971.88, 115970.84] | 460 | 0.252174 | 0.004612 | 116.0 | 344.0 | 0.004540 | 0.004637 | 0.682653 | 0.007566 | 0.019854 | 0.001671 |
| 29 | (115970.84, 119969.8] | 388 | 0.280928 | 0.003890 | 109.0 | 279.0 | 0.004266 | 0.003761 | 0.758177 | 0.028754 | 0.075524 | 0.001671 |
| 30 | (119969.8, 123968.76] | 363 | 0.250689 | 0.003640 | 91.0 | 272.0 | 0.003562 | 0.003666 | 0.678757 | 0.030239 | 0.079420 | 0.001671 |
| 31 | (123968.76, 127967.72] | 324 | 0.231481 | 0.003249 | 75.0 | 249.0 | 0.002935 | 0.003356 | 0.628384 | 0.019207 | 0.050373 | 0.001671 |
| 32 | (127967.72, 131966.68] | 320 | 0.250000 | 0.003208 | 80.0 | 240.0 | 0.003131 | 0.003235 | 0.676950 | 0.018519 | 0.048566 | 0.001671 |
| 33 | (131966.68, 135965.64] | 293 | 0.242321 | 0.002938 | 71.0 | 222.0 | 0.002779 | 0.002992 | 0.656809 | 0.007679 | 0.020141 | 0.001671 |
| 34 | (135965.64, 139964.6] | 241 | 0.286307 | 0.002416 | 69.0 | 172.0 | 0.002701 | 0.002318 | 0.772336 | 0.043986 | 0.115526 | 0.001671 |
| 35 | (139964.6, 143963.56] | 237 | 0.253165 | 0.002376 | 60.0 | 177.0 | 0.002348 | 0.002386 | 0.685252 | 0.033142 | 0.087084 | 0.001671 |
| 36 | (143963.56, 147962.52] | 217 | 0.253456 | 0.002176 | 55.0 | 162.0 | 0.002153 | 0.002184 | 0.686017 | 0.000292 | 0.000765 | 0.001671 |
| 37 | (147962.52, 151961.48] | 179 | 0.256983 | 0.001795 | 46.0 | 133.0 | 0.001800 | 0.001793 | 0.695271 | 0.003527 | 0.009254 | 0.001671 |
| 38 | (151961.48, 155960.44] | 189 | 0.243386 | 0.001895 | 46.0 | 143.0 | 0.001800 | 0.001928 | 0.659604 | 0.013597 | 0.035668 | 0.001671 |
| 39 | (155960.44, 159959.4] | 184 | 0.217391 | 0.001845 | 40.0 | 144.0 | 0.001566 | 0.001941 | 0.591422 | 0.025995 | 0.068181 | 0.001671 |
| 40 | (159959.4, 163958.36] | 169 | 0.254438 | 0.001694 | 43.0 | 126.0 | 0.001683 | 0.001698 | 0.688593 | 0.037047 | 0.097170 | 0.001671 |
| 41 | (163958.36, 167957.32] | 139 | 0.287770 | 0.001394 | 40.0 | 99.0 | 0.001566 | 0.001334 | 0.776188 | 0.033332 | 0.087595 | 0.001671 |
| 42 | (167957.32, 171956.28] | 128 | 0.195312 | 0.001283 | 25.0 | 103.0 | 0.000978 | 0.001388 | 0.533423 | 0.092457 | 0.242765 | 0.001671 |
| 43 | (171956.28, 175955.24] | 114 | 0.175439 | 0.001143 | 20.0 | 94.0 | 0.000783 | 0.001267 | 0.481059 | 0.019874 | 0.052363 | 0.001671 |
| 44 | (175955.24, 179954.2] | 124 | 0.217742 | 0.001243 | 27.0 | 97.0 | 0.001057 | 0.001308 | 0.592342 | 0.042303 | 0.111283 | 0.001671 |
| 45 | (179954.2, 183953.16] | 107 | 0.280374 | 0.001073 | 30.0 | 77.0 | 0.001174 | 0.001038 | 0.756719 | 0.062632 | 0.164377 | 0.001671 |
| 46 | (183953.16, 187952.12] | 110 | 0.200000 | 0.001103 | 22.0 | 88.0 | 0.000861 | 0.001186 | 0.545749 | 0.080374 | 0.210971 | 0.001671 |
| 47 | (187952.12, 191951.08] | 108 | 0.194444 | 0.001083 | 21.0 | 87.0 | 0.000822 | 0.001173 | 0.531139 | 0.005556 | 0.014609 | 0.001671 |
| 48 | (191951.08, 195950.04] | 77 | 0.311688 | 0.000772 | 24.0 | 53.0 | 0.000939 | 0.000714 | 0.839340 | 0.117244 | 0.308200 | 0.001671 |
| 49 | (195950.04, 199949.0] | 74 | 0.243243 | 0.000742 | 18.0 | 56.0 | 0.000705 | 0.000755 | 0.659229 | 0.068445 | 0.180111 | 0.001671 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '= 0', '0 - 18000', '18000 - 30000', '30000 - 70000', '70000 - 200000', '> 200000'
df_inputs_prepr['total_bal_il:=0'] = np.where((df_inputs_prepr['total_bal_il'] == 0.), 1, 0)
df_inputs_prepr['total_bal_il:0-18k'] = np.where((df_inputs_prepr['total_bal_il'] > 0.) & (df_inputs_prepr['total_bal_il'] <= 18000.), 1, 0)
df_inputs_prepr['total_bal_il:18-30k'] = np.where((df_inputs_prepr['total_bal_il'] > 18000.) & (df_inputs_prepr['total_bal_il'] <= 30000.), 1, 0)
df_inputs_prepr['total_bal_il:30-70k'] = np.where((df_inputs_prepr['total_bal_il'] > 30000.) & (df_inputs_prepr['total_bal_il'] <= 70000.), 1, 0)
df_inputs_prepr['total_bal_il:70-200k'] = np.where((df_inputs_prepr['total_bal_il'] > 70000.) & (df_inputs_prepr['total_bal_il'] <= 200000.), 1, 0)
df_inputs_prepr['total_bal_il:>200k'] = np.where((df_inputs_prepr['total_bal_il'] > 200000.), 1, 0)
df_inputs_prepr = df_inputs_prepr.drop(columns = ['total_bal_il_factor'])
# Drop the provisory feature
Variable: 'max_bal_bc'¶
# unique values
df_inputs_prepr['max_bal_bc'].unique()
array([ 0., 11140., 3179., ..., 26142., 15475., 24378.])
# number of observations with 0 value
df_inputs_prepr['max_bal_bc'].value_counts()[0]
164273
# one category will be created for 'max_bal_bc' = 0 with a count value = 660128.
# one other category will be created for 'max_bal_bc' > 50000.
#********************************
# 'max_bal_bc'
# the categories of everyone with 'total_bal_il' different of 0.
df_inputs_prepr_temp = df_inputs_prepr.loc[(df_inputs_prepr['max_bal_bc'] != 0) & (df_inputs_prepr['max_bal_bc'] <= 50000), : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['max_bal_bc_factor'] = pd.cut(df_inputs_prepr_temp['max_bal_bc'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'max_bal_bc_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3384841133.py:10: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['max_bal_bc_factor'] = pd.cut(df_inputs_prepr_temp['max_bal_bc'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| max_bal_bc_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-48.997, 1000.94] | 9830 | 0.278535 | 0.089429 | 2738.0 | 7092.0 | 0.097109 | 0.086780 | 0.750959 | NaN | NaN | 0.0056 |
| 1 | (1000.94, 2000.88] | 13143 | 0.278095 | 0.119570 | 3655.0 | 9488.0 | 0.129633 | 0.116098 | 0.749802 | 0.000440 | 0.001157 | 0.0056 |
| 2 | (2000.88, 3000.82] | 14661 | 0.267103 | 0.133380 | 3916.0 | 10745.0 | 0.138890 | 0.131479 | 0.720940 | 0.010992 | 0.028862 | 0.0056 |
| 3 | (3000.82, 4000.76] | 13546 | 0.262734 | 0.123236 | 3559.0 | 9987.0 | 0.126228 | 0.122204 | 0.709478 | 0.004369 | 0.011462 | 0.0056 |
| 4 | (4000.76, 5000.7] | 12530 | 0.259936 | 0.113993 | 3257.0 | 9273.0 | 0.115517 | 0.113467 | 0.702139 | 0.002798 | 0.007339 | 0.0056 |
| 5 | (5000.7, 6000.64] | 9542 | 0.262104 | 0.086809 | 2501.0 | 7041.0 | 0.088704 | 0.086156 | 0.707825 | 0.002168 | 0.005687 | 0.0056 |
| 6 | (6000.64, 7000.58] | 7337 | 0.243287 | 0.066749 | 1785.0 | 5552.0 | 0.063309 | 0.067936 | 0.658501 | 0.018817 | 0.049324 | 0.0056 |
| 7 | (7000.58, 8000.52] | 5866 | 0.247017 | 0.053367 | 1449.0 | 4417.0 | 0.051392 | 0.054048 | 0.668272 | 0.003729 | 0.009772 | 0.0056 |
| 8 | (8000.52, 9000.46] | 4461 | 0.239632 | 0.040584 | 1069.0 | 3392.0 | 0.037915 | 0.041506 | 0.648924 | 0.007384 | 0.019349 | 0.0056 |
| 9 | (9000.46, 10000.4] | 3969 | 0.244646 | 0.036108 | 971.0 | 2998.0 | 0.034439 | 0.036684 | 0.662060 | 0.005014 | 0.013136 | 0.0056 |
| 10 | (10000.4, 11000.34] | 2489 | 0.233427 | 0.022644 | 581.0 | 1908.0 | 0.020606 | 0.023347 | 0.632666 | 0.011219 | 0.029394 | 0.0056 |
| 11 | (11000.34, 12000.28] | 2023 | 0.235788 | 0.018404 | 477.0 | 1546.0 | 0.016918 | 0.018917 | 0.638853 | 0.002361 | 0.006187 | 0.0056 |
| 12 | (12000.28, 13000.22] | 1607 | 0.212819 | 0.014620 | 342.0 | 1265.0 | 0.012130 | 0.015479 | 0.578653 | 0.022970 | 0.060200 | 0.0056 |
| 13 | (13000.22, 14000.16] | 1419 | 0.208598 | 0.012910 | 296.0 | 1123.0 | 0.010498 | 0.013741 | 0.567580 | 0.004221 | 0.011073 | 0.0056 |
| 14 | (14000.16, 15000.1] | 1316 | 0.219605 | 0.011972 | 289.0 | 1027.0 | 0.010250 | 0.012567 | 0.596445 | 0.011007 | 0.028865 | 0.0056 |
| 15 | (15000.1, 16000.04] | 1047 | 0.225406 | 0.009525 | 236.0 | 811.0 | 0.008370 | 0.009924 | 0.611649 | 0.005801 | 0.015204 | 0.0056 |
| 16 | (16000.04, 16999.98] | 789 | 0.212928 | 0.007178 | 168.0 | 621.0 | 0.005959 | 0.007599 | 0.578938 | 0.012478 | 0.032711 | 0.0056 |
| 17 | (16999.98, 17999.92] | 775 | 0.216774 | 0.007051 | 168.0 | 607.0 | 0.005959 | 0.007427 | 0.589024 | 0.003846 | 0.010086 | 0.0056 |
| 18 | (17999.92, 18999.86] | 602 | 0.215947 | 0.005477 | 130.0 | 472.0 | 0.004611 | 0.005776 | 0.586855 | 0.000827 | 0.002169 | 0.0056 |
| 19 | (18999.86, 19999.8] | 591 | 0.204738 | 0.005377 | 121.0 | 470.0 | 0.004292 | 0.005751 | 0.557452 | 0.011209 | 0.029403 | 0.0056 |
| 20 | (19999.8, 20999.74] | 398 | 0.233668 | 0.003621 | 93.0 | 305.0 | 0.003298 | 0.003732 | 0.633298 | 0.028931 | 0.075847 | 0.0056 |
| 21 | (20999.74, 21999.68] | 298 | 0.167785 | 0.002711 | 50.0 | 248.0 | 0.001773 | 0.003035 | 0.460194 | 0.065883 | 0.173105 | 0.0056 |
| 22 | (21999.68, 22999.62] | 269 | 0.189591 | 0.002447 | 51.0 | 218.0 | 0.001809 | 0.002668 | 0.517660 | 0.021806 | 0.057466 | 0.0056 |
| 23 | (22999.62, 23999.56] | 268 | 0.205224 | 0.002438 | 55.0 | 213.0 | 0.001951 | 0.002606 | 0.558728 | 0.015633 | 0.041068 | 0.0056 |
| 24 | (23999.56, 24999.5] | 261 | 0.203065 | 0.002374 | 53.0 | 208.0 | 0.001880 | 0.002545 | 0.553061 | 0.002159 | 0.005666 | 0.0056 |
| 25 | (24999.5, 25999.44] | 147 | 0.210884 | 0.001337 | 31.0 | 116.0 | 0.001099 | 0.001419 | 0.573579 | 0.007819 | 0.020517 | 0.0056 |
| 26 | (25999.44, 26999.38] | 116 | 0.215517 | 0.001055 | 25.0 | 91.0 | 0.000887 | 0.001114 | 0.585728 | 0.004633 | 0.012150 | 0.0056 |
| 27 | (26999.38, 27999.32] | 94 | 0.138298 | 0.000855 | 13.0 | 81.0 | 0.000461 | 0.000991 | 0.381989 | 0.077219 | 0.203739 | 0.0056 |
| 28 | (27999.32, 28999.26] | 70 | 0.171429 | 0.000637 | 12.0 | 58.0 | 0.000426 | 0.000710 | 0.469813 | 0.033131 | 0.087824 | 0.0056 |
| 29 | (28999.26, 29999.2] | 78 | 0.282051 | 0.000710 | 22.0 | 56.0 | 0.000780 | 0.000685 | 0.760202 | 0.110623 | 0.290388 | 0.0056 |
| 30 | (29999.2, 30999.14] | 52 | 0.153846 | 0.000473 | 8.0 | 44.0 | 0.000284 | 0.000538 | 0.423308 | 0.128205 | 0.336893 | 0.0056 |
| 31 | (30999.14, 31999.08] | 37 | 0.216216 | 0.000337 | 8.0 | 29.0 | 0.000284 | 0.000355 | 0.587561 | 0.062370 | 0.164253 | 0.0056 |
| 32 | (31999.08, 32999.02] | 26 | 0.115385 | 0.000237 | 3.0 | 23.0 | 0.000106 | 0.000281 | 0.320683 | 0.100832 | 0.266878 | 0.0056 |
| 33 | (32999.02, 33998.96] | 30 | 0.366667 | 0.000273 | 11.0 | 19.0 | 0.000390 | 0.000232 | 0.985106 | 0.251282 | 0.664423 | 0.0056 |
| 34 | (33998.96, 34998.9] | 49 | 0.244898 | 0.000446 | 12.0 | 37.0 | 0.000426 | 0.000453 | 0.662721 | 0.121769 | 0.322385 | 0.0056 |
| 35 | (34998.9, 35998.84] | 24 | 0.166667 | 0.000218 | 4.0 | 20.0 | 0.000142 | 0.000245 | 0.457239 | 0.078231 | 0.205482 | 0.0056 |
| 36 | (35998.84, 36998.78] | 13 | 0.307692 | 0.000118 | 4.0 | 9.0 | 0.000142 | 0.000110 | 0.827781 | 0.141026 | 0.370542 | 0.0056 |
| 37 | (36998.78, 37998.72] | 27 | 0.222222 | 0.000246 | 6.0 | 21.0 | 0.000213 | 0.000257 | 0.603305 | 0.085470 | 0.224476 | 0.0056 |
| 38 | (37998.72, 38998.66] | 17 | 0.294118 | 0.000155 | 5.0 | 12.0 | 0.000177 | 0.000147 | 0.791960 | 0.071895 | 0.188655 | 0.0056 |
| 39 | (38998.66, 39998.6] | 15 | 0.133333 | 0.000136 | 2.0 | 13.0 | 0.000071 | 0.000159 | 0.368751 | 0.160784 | 0.423209 | 0.0056 |
| 40 | (39998.6, 40998.54] | 13 | 0.153846 | 0.000118 | 2.0 | 11.0 | 0.000071 | 0.000135 | 0.423308 | 0.020513 | 0.054557 | 0.0056 |
| 41 | (40998.54, 41998.48] | 3 | 0.000000 | 0.000027 | 0.0 | 3.0 | 0.000000 | 0.000037 | 0.000000 | 0.153846 | 0.423308 | 0.0056 |
| 42 | (41998.48, 42998.42] | 11 | 0.272727 | 0.000100 | 3.0 | 8.0 | 0.000106 | 0.000098 | 0.735703 | 0.272727 | 0.735703 | 0.0056 |
| 43 | (42998.42, 43998.36] | 9 | 0.000000 | 0.000082 | 0.0 | 9.0 | 0.000000 | 0.000110 | 0.000000 | 0.272727 | 0.735703 | 0.0056 |
| 44 | (43998.36, 44998.3] | 8 | 0.375000 | 0.000073 | 3.0 | 5.0 | 0.000106 | 0.000061 | 1.007636 | 0.375000 | 1.007636 | 0.0056 |
| 45 | (44998.3, 45998.24] | 8 | 0.125000 | 0.000073 | 1.0 | 7.0 | 0.000035 | 0.000086 | 0.346476 | 0.250000 | 0.661160 | 0.0056 |
| 46 | (45998.24, 46998.18] | 6 | 0.333333 | 0.000055 | 2.0 | 4.0 | 0.000071 | 0.000049 | 0.895788 | 0.208333 | 0.549312 | 0.0056 |
| 47 | (46998.18, 47998.12] | 10 | 0.300000 | 0.000091 | 3.0 | 7.0 | 0.000106 | 0.000086 | 0.807469 | 0.033333 | 0.088318 | 0.0056 |
| 48 | (47998.12, 48998.06] | 5 | 0.400000 | 0.000045 | 2.0 | 3.0 | 0.000071 | 0.000037 | 1.075805 | 0.100000 | 0.268336 | 0.0056 |
| 49 | (48998.06, 49998.0] | 14 | 0.214286 | 0.000127 | 3.0 | 11.0 | 0.000106 | 0.000135 | 0.582499 | 0.185714 | 0.493306 | 0.0056 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '= 0', '0 - 8000', '8000 - 16000', '16000 - 26000', '26000 - 50000', '> 50000'
df_inputs_prepr['max_bal_bc:=0'] = np.where((df_inputs_prepr['max_bal_bc'] == 0.), 1, 0)
df_inputs_prepr['max_bal_bc:0-8k'] = np.where((df_inputs_prepr['max_bal_bc'] > 0.) & (df_inputs_prepr['max_bal_bc'] <= 8000.), 1, 0)
df_inputs_prepr['max_bal_bc:8-16k'] = np.where((df_inputs_prepr['max_bal_bc'] > 8000.) & (df_inputs_prepr['max_bal_bc'] <= 16000.), 1, 0)
df_inputs_prepr['max_bal_bc:16-26k'] = np.where((df_inputs_prepr['max_bal_bc'] > 16000.) & (df_inputs_prepr['max_bal_bc'] <= 26000.), 1, 0)
df_inputs_prepr['max_bal_bc:26-50k'] = np.where((df_inputs_prepr['max_bal_bc'] > 26000.) & (df_inputs_prepr['max_bal_bc'] <= 50000.), 1, 0)
df_inputs_prepr['max_bal_bc:>50k'] = np.where((df_inputs_prepr['max_bal_bc'] > 50000.), 1, 0)
Variable: 'avg_cur_bal'¶
# unique values
df_inputs_prepr['avg_cur_bal'].unique()
array([ 4658., 7654., 7645., ..., 58635., 49593., 61961.])
# one other category will be created for 'avg_cur_bal' > 100000.
#********************************
# 'avg_cur_bal'
# the categories of everyone with 'avg_cur_bal' different of 0.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['avg_cur_bal'] <= 100000, : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['avg_cur_bal_factor'] = pd.cut(df_inputs_prepr_temp['avg_cur_bal'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'avg_cur_bal_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\712759651.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['avg_cur_bal_factor'] = pd.cut(df_inputs_prepr_temp['avg_cur_bal'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| avg_cur_bal_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-99.959, 1999.18] | 35692 | 0.246610 | 0.130586 | 8802.0 | 26890.0 | 0.149990 | 0.125281 | 0.787195 | NaN | NaN | 0.022469 |
| 1 | (1999.18, 3998.36] | 50295 | 0.251735 | 0.184014 | 12661.0 | 37634.0 | 0.215749 | 0.175338 | 0.802214 | 0.005125 | 0.015019 | 0.022469 |
| 2 | (3998.36, 5997.54] | 30578 | 0.248970 | 0.111876 | 7613.0 | 22965.0 | 0.129729 | 0.106995 | 0.794114 | 0.002765 | 0.008101 | 0.022469 |
| 3 | (5997.54, 7996.72] | 31977 | 0.203771 | 0.116994 | 6516.0 | 25461.0 | 0.111035 | 0.118624 | 0.660640 | 0.045198 | 0.133473 | 0.022469 |
| 4 | (7996.72, 9995.9] | 13877 | 0.224616 | 0.050772 | 3117.0 | 10760.0 | 0.053115 | 0.050131 | 0.722473 | 0.020845 | 0.061833 | 0.022469 |
| 5 | (9995.9, 11995.08] | 12279 | 0.217037 | 0.044925 | 2665.0 | 9614.0 | 0.045413 | 0.044792 | 0.700053 | 0.007579 | 0.022420 | 0.022469 |
| 6 | (11995.08, 13994.26] | 11237 | 0.203613 | 0.041113 | 2288.0 | 8949.0 | 0.038988 | 0.041694 | 0.660168 | 0.013424 | 0.039885 | 0.022469 |
| 7 | (13994.26, 15993.44] | 10383 | 0.192237 | 0.037988 | 1996.0 | 8387.0 | 0.034013 | 0.039075 | 0.626174 | 0.011376 | 0.033995 | 0.022469 |
| 8 | (15993.44, 17992.62] | 9522 | 0.198173 | 0.034838 | 1887.0 | 7635.0 | 0.032155 | 0.035572 | 0.643934 | 0.005935 | 0.017761 | 0.022469 |
| 9 | (17992.62, 19991.8] | 8585 | 0.181712 | 0.031410 | 1560.0 | 7025.0 | 0.026583 | 0.032730 | 0.594542 | 0.016460 | 0.049393 | 0.022469 |
| 10 | (19991.8, 21990.98] | 7630 | 0.190170 | 0.027916 | 1451.0 | 6179.0 | 0.024726 | 0.028788 | 0.619976 | 0.008458 | 0.025434 | 0.022469 |
| 11 | (21990.98, 23990.16] | 6668 | 0.172016 | 0.024396 | 1147.0 | 5521.0 | 0.019545 | 0.025722 | 0.565231 | 0.018155 | 0.054745 | 0.022469 |
| 12 | (23990.16, 25989.34] | 5819 | 0.177006 | 0.021290 | 1030.0 | 4789.0 | 0.017552 | 0.022312 | 0.580338 | 0.004991 | 0.015107 | 0.022469 |
| 13 | (25989.34, 27988.52] | 5157 | 0.173744 | 0.018868 | 896.0 | 4261.0 | 0.015268 | 0.019852 | 0.570469 | 0.003262 | 0.009869 | 0.022469 |
| 14 | (27988.52, 29987.7] | 4416 | 0.162817 | 0.016157 | 719.0 | 3697.0 | 0.012252 | 0.017224 | 0.537264 | 0.010927 | 0.033205 | 0.022469 |
| 15 | (29987.7, 31986.88] | 3822 | 0.158817 | 0.013984 | 607.0 | 3215.0 | 0.010344 | 0.014979 | 0.525052 | 0.004000 | 0.012213 | 0.022469 |
| 16 | (31986.88, 33986.06] | 3328 | 0.156550 | 0.012176 | 521.0 | 2807.0 | 0.008878 | 0.013078 | 0.518115 | 0.002267 | 0.006937 | 0.022469 |
| 17 | (33986.06, 35985.24] | 2845 | 0.159930 | 0.010409 | 455.0 | 2390.0 | 0.007753 | 0.011135 | 0.528451 | 0.003379 | 0.010336 | 0.022469 |
| 18 | (35985.24, 37984.42] | 2445 | 0.161963 | 0.008946 | 396.0 | 2049.0 | 0.006748 | 0.009546 | 0.534660 | 0.002033 | 0.006209 | 0.022469 |
| 19 | (37984.42, 39983.6] | 2161 | 0.160574 | 0.007906 | 347.0 | 1814.0 | 0.005913 | 0.008451 | 0.530419 | 0.001389 | 0.004241 | 0.022469 |
| 20 | (39983.6, 41982.78] | 1800 | 0.162778 | 0.006586 | 293.0 | 1507.0 | 0.004993 | 0.007021 | 0.537145 | 0.002204 | 0.006726 | 0.022469 |
| 21 | (41982.78, 43981.96] | 1538 | 0.141743 | 0.005627 | 218.0 | 1320.0 | 0.003715 | 0.006150 | 0.472527 | 0.021035 | 0.064618 | 0.022469 |
| 22 | (43981.96, 45981.14] | 1417 | 0.143260 | 0.005184 | 203.0 | 1214.0 | 0.003459 | 0.005656 | 0.477223 | 0.001518 | 0.004696 | 0.022469 |
| 23 | (45981.14, 47980.32] | 1246 | 0.154896 | 0.004559 | 193.0 | 1053.0 | 0.003289 | 0.004906 | 0.513044 | 0.011635 | 0.035822 | 0.022469 |
| 24 | (47980.32, 49979.5] | 1020 | 0.139216 | 0.003732 | 142.0 | 878.0 | 0.002420 | 0.004091 | 0.464697 | 0.015680 | 0.048347 | 0.022469 |
| 25 | (49979.5, 51978.68] | 919 | 0.126224 | 0.003362 | 116.0 | 803.0 | 0.001977 | 0.003741 | 0.424193 | 0.012992 | 0.040504 | 0.022469 |
| 26 | (51978.68, 53977.86] | 781 | 0.134443 | 0.002857 | 105.0 | 676.0 | 0.001789 | 0.003150 | 0.449867 | 0.008219 | 0.025674 | 0.022469 |
| 27 | (53977.86, 55977.04] | 698 | 0.136103 | 0.002554 | 95.0 | 603.0 | 0.001619 | 0.002809 | 0.455032 | 0.001660 | 0.005165 | 0.022469 |
| 28 | (55977.04, 57976.22] | 594 | 0.134680 | 0.002173 | 80.0 | 514.0 | 0.001363 | 0.002395 | 0.450605 | 0.001423 | 0.004427 | 0.022469 |
| 29 | (57976.22, 59975.4] | 542 | 0.134686 | 0.001983 | 73.0 | 469.0 | 0.001244 | 0.002185 | 0.450624 | 0.000006 | 0.000019 | 0.022469 |
| 30 | (59975.4, 61974.58] | 499 | 0.128257 | 0.001826 | 64.0 | 435.0 | 0.001091 | 0.002027 | 0.430558 | 0.006430 | 0.020066 | 0.022469 |
| 31 | (61974.58, 63973.76] | 441 | 0.111111 | 0.001613 | 49.0 | 392.0 | 0.000835 | 0.001826 | 0.376509 | 0.017145 | 0.054049 | 0.022469 |
| 32 | (63973.76, 65972.94] | 367 | 0.128065 | 0.001343 | 47.0 | 320.0 | 0.000801 | 0.001491 | 0.429960 | 0.016954 | 0.053451 | 0.022469 |
| 33 | (65972.94, 67972.12] | 317 | 0.138801 | 0.001160 | 44.0 | 273.0 | 0.000750 | 0.001272 | 0.463412 | 0.010736 | 0.033452 | 0.022469 |
| 34 | (67972.12, 69971.3] | 281 | 0.135231 | 0.001028 | 38.0 | 243.0 | 0.000648 | 0.001132 | 0.452320 | 0.003570 | 0.011092 | 0.022469 |
| 35 | (69971.3, 71970.48] | 249 | 0.136546 | 0.000911 | 34.0 | 215.0 | 0.000579 | 0.001002 | 0.456409 | 0.001315 | 0.004089 | 0.022469 |
| 36 | (71970.48, 73969.66] | 247 | 0.097166 | 0.000904 | 24.0 | 223.0 | 0.000409 | 0.001039 | 0.331914 | 0.039380 | 0.124495 | 0.022469 |
| 37 | (73969.66, 75968.84] | 219 | 0.155251 | 0.000801 | 34.0 | 185.0 | 0.000579 | 0.000862 | 0.514134 | 0.058085 | 0.182220 | 0.022469 |
| 38 | (75968.84, 77968.02] | 190 | 0.110526 | 0.000695 | 21.0 | 169.0 | 0.000358 | 0.000787 | 0.374650 | 0.044725 | 0.139484 | 0.022469 |
| 39 | (77968.02, 79967.2] | 176 | 0.142045 | 0.000644 | 25.0 | 151.0 | 0.000426 | 0.000704 | 0.473465 | 0.031519 | 0.098814 | 0.022469 |
| 40 | (79967.2, 81966.38] | 180 | 0.111111 | 0.000659 | 20.0 | 160.0 | 0.000341 | 0.000745 | 0.376509 | 0.030934 | 0.096956 | 0.022469 |
| 41 | (81966.38, 83965.56] | 143 | 0.090909 | 0.000523 | 13.0 | 130.0 | 0.000222 | 0.000606 | 0.311704 | 0.020202 | 0.064805 | 0.022469 |
| 42 | (83965.56, 85964.74] | 114 | 0.096491 | 0.000417 | 11.0 | 103.0 | 0.000187 | 0.000480 | 0.329741 | 0.005582 | 0.018036 | 0.022469 |
| 43 | (85964.74, 87963.92] | 128 | 0.132812 | 0.000468 | 17.0 | 111.0 | 0.000290 | 0.000517 | 0.444787 | 0.036321 | 0.115047 | 0.022469 |
| 44 | (87963.92, 89963.1] | 109 | 0.110092 | 0.000399 | 12.0 | 97.0 | 0.000204 | 0.000452 | 0.373269 | 0.022721 | 0.071518 | 0.022469 |
| 45 | (89963.1, 91962.28] | 102 | 0.078431 | 0.000373 | 8.0 | 94.0 | 0.000136 | 0.000438 | 0.271001 | 0.031660 | 0.102267 | 0.022469 |
| 46 | (91962.28, 93961.46] | 92 | 0.065217 | 0.000337 | 6.0 | 86.0 | 0.000102 | 0.000401 | 0.227275 | 0.013214 | 0.043727 | 0.022469 |
| 47 | (93961.46, 95960.64] | 66 | 0.151515 | 0.000241 | 10.0 | 56.0 | 0.000170 | 0.000261 | 0.502668 | 0.086298 | 0.275393 | 0.022469 |
| 48 | (95960.64, 97959.82] | 66 | 0.136364 | 0.000241 | 9.0 | 57.0 | 0.000153 | 0.000266 | 0.455842 | 0.015152 | 0.046826 | 0.022469 |
| 49 | (97959.82, 99959.0] | 64 | 0.093750 | 0.000234 | 6.0 | 58.0 | 0.000102 | 0.000270 | 0.320896 | 0.042614 | 0.134946 | 0.022469 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '< 7000', '7000 - 15000', '15000 - 30000', '30000 - 50000', '50000 - 100000', '> 100000'
df_inputs_prepr['avg_cur_bal:0-7k'] = np.where((df_inputs_prepr['avg_cur_bal'] >= 0.) & (df_inputs_prepr['avg_cur_bal'] <= 7000.), 1, 0)
df_inputs_prepr['avg_cur_bal:7-15k'] = np.where((df_inputs_prepr['avg_cur_bal'] > 7000.) & (df_inputs_prepr['avg_cur_bal'] <= 15000.), 1, 0)
df_inputs_prepr['avg_cur_bal:15-30k'] = np.where((df_inputs_prepr['avg_cur_bal'] > 15000.) & (df_inputs_prepr['avg_cur_bal'] <= 30000.), 1, 0)
df_inputs_prepr['avg_cur_bal:30-50k'] = np.where((df_inputs_prepr['avg_cur_bal'] > 30000.) & (df_inputs_prepr['avg_cur_bal'] <= 50000.), 1, 0)
df_inputs_prepr['avg_cur_bal:50-100k'] = np.where((df_inputs_prepr['avg_cur_bal'] > 50000.) & (df_inputs_prepr['avg_cur_bal'] <= 100000.), 1, 0)
df_inputs_prepr['avg_cur_bal:>100k'] = np.where((df_inputs_prepr['avg_cur_bal'] > 100000.), 1, 0)
Variable: 'bc_open_to_buy'¶
# unique values
df_inputs_prepr['bc_open_to_buy'].unique()
array([ 1221., 19625., 207., ..., 37116., 37369., 32213.])
# one other category will be created for 'bc_open_to_buy' > 100000.
#********************************
# 'bc_open_to_buy'
# the categories of everyone with 'bc_open_to_buy' different of 0.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['bc_open_to_buy'] <= 100000, : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['bc_open_to_buy_factor'] = pd.cut(df_inputs_prepr_temp['bc_open_to_buy'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'bc_open_to_buy_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\284082808.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['bc_open_to_buy_factor'] = pd.cut(df_inputs_prepr_temp['bc_open_to_buy'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| bc_open_to_buy_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-99.997, 1999.94] | 79292 | 0.248865 | 0.290178 | 19733.0 | 59559.0 | 0.336115 | 0.277607 | 0.793336 | NaN | NaN | 0.020203 |
| 1 | (1999.94, 3999.88] | 40816 | 0.237505 | 0.149371 | 9694.0 | 31122.0 | 0.165119 | 0.145061 | 0.759999 | 0.011360 | 0.033336 | 0.020203 |
| 2 | (3999.88, 5999.82] | 40135 | 0.210166 | 0.146879 | 8435.0 | 31700.0 | 0.143675 | 0.147755 | 0.679243 | 0.027339 | 0.080757 | 0.020203 |
| 3 | (5999.82, 7999.76] | 19726 | 0.215300 | 0.072190 | 4247.0 | 15479.0 | 0.072340 | 0.072148 | 0.694473 | 0.005134 | 0.015231 | 0.020203 |
| 4 | (7999.76, 9999.7] | 14927 | 0.202184 | 0.054627 | 3018.0 | 11909.0 | 0.051406 | 0.055508 | 0.655495 | 0.013116 | 0.038979 | 0.020203 |
| 5 | (9999.7, 11999.64] | 11763 | 0.206920 | 0.043048 | 2434.0 | 9329.0 | 0.041459 | 0.043483 | 0.669596 | 0.004736 | 0.014101 | 0.020203 |
| 6 | (11999.64, 13999.58] | 9316 | 0.197617 | 0.034093 | 1841.0 | 7475.0 | 0.031358 | 0.034841 | 0.641867 | 0.009303 | 0.027729 | 0.020203 |
| 7 | (13999.58, 15999.52] | 7759 | 0.185591 | 0.028395 | 1440.0 | 6319.0 | 0.024528 | 0.029453 | 0.605829 | 0.012026 | 0.036037 | 0.020203 |
| 8 | (15999.52, 17999.46] | 6360 | 0.183176 | 0.023275 | 1165.0 | 5195.0 | 0.019844 | 0.024214 | 0.598565 | 0.002415 | 0.007264 | 0.020203 |
| 9 | (17999.46, 19999.4] | 5465 | 0.185910 | 0.020000 | 1016.0 | 4449.0 | 0.017306 | 0.020737 | 0.606789 | 0.002734 | 0.008224 | 0.020203 |
| 10 | (19999.4, 21999.34] | 4618 | 0.172152 | 0.016900 | 795.0 | 3823.0 | 0.013541 | 0.017819 | 0.565275 | 0.013758 | 0.041514 | 0.020203 |
| 11 | (21999.34, 23999.28] | 3823 | 0.171593 | 0.013991 | 656.0 | 3167.0 | 0.011174 | 0.014762 | 0.563580 | 0.000559 | 0.001695 | 0.020203 |
| 12 | (23999.28, 25999.22] | 3527 | 0.176354 | 0.012907 | 622.0 | 2905.0 | 0.010595 | 0.013540 | 0.577988 | 0.004761 | 0.014409 | 0.020203 |
| 13 | (25999.22, 27999.16] | 2843 | 0.168484 | 0.010404 | 479.0 | 2364.0 | 0.008159 | 0.011019 | 0.554148 | 0.007870 | 0.023841 | 0.020203 |
| 14 | (27999.16, 29999.1] | 2488 | 0.162379 | 0.009105 | 404.0 | 2084.0 | 0.006881 | 0.009714 | 0.535573 | 0.006105 | 0.018574 | 0.020203 |
| 15 | (29999.1, 31999.04] | 2306 | 0.150043 | 0.008439 | 346.0 | 1960.0 | 0.005893 | 0.009136 | 0.497805 | 0.012336 | 0.037768 | 0.020203 |
| 16 | (31999.04, 33998.98] | 1986 | 0.146022 | 0.007268 | 290.0 | 1696.0 | 0.004940 | 0.007905 | 0.485423 | 0.004021 | 0.012383 | 0.020203 |
| 17 | (33998.98, 35998.92] | 1682 | 0.140309 | 0.006155 | 236.0 | 1446.0 | 0.004020 | 0.006740 | 0.467766 | 0.005713 | 0.017656 | 0.020203 |
| 18 | (35998.92, 37998.86] | 1521 | 0.132150 | 0.005566 | 201.0 | 1320.0 | 0.003424 | 0.006153 | 0.442414 | 0.008159 | 0.025352 | 0.020203 |
| 19 | (37998.86, 39998.8] | 1411 | 0.138909 | 0.005164 | 196.0 | 1215.0 | 0.003339 | 0.005663 | 0.463426 | 0.006759 | 0.021012 | 0.020203 |
| 20 | (39998.8, 41998.74] | 1213 | 0.136026 | 0.004439 | 165.0 | 1048.0 | 0.002810 | 0.004885 | 0.454479 | 0.002882 | 0.008947 | 0.020203 |
| 21 | (41998.74, 43998.68] | 1061 | 0.136664 | 0.003883 | 145.0 | 916.0 | 0.002470 | 0.004270 | 0.456459 | 0.000637 | 0.001980 | 0.020203 |
| 22 | (43998.68, 45998.62] | 948 | 0.165612 | 0.003469 | 157.0 | 791.0 | 0.002674 | 0.003687 | 0.545418 | 0.028948 | 0.088959 | 0.020203 |
| 23 | (45998.62, 47998.56] | 843 | 0.134045 | 0.003085 | 113.0 | 730.0 | 0.001925 | 0.003403 | 0.448317 | 0.031567 | 0.097100 | 0.020203 |
| 24 | (47998.56, 49998.5] | 726 | 0.119835 | 0.002657 | 87.0 | 639.0 | 0.001482 | 0.002978 | 0.403825 | 0.014210 | 0.044492 | 0.020203 |
| 25 | (49998.5, 51998.44] | 702 | 0.136752 | 0.002569 | 96.0 | 606.0 | 0.001635 | 0.002825 | 0.456734 | 0.016917 | 0.052909 | 0.020203 |
| 26 | (51998.44, 53998.38] | 608 | 0.139803 | 0.002225 | 85.0 | 523.0 | 0.001448 | 0.002438 | 0.466197 | 0.003050 | 0.009463 | 0.020203 |
| 27 | (53998.38, 55998.32] | 547 | 0.126143 | 0.002002 | 69.0 | 478.0 | 0.001175 | 0.002228 | 0.423641 | 0.013660 | 0.042557 | 0.020203 |
| 28 | (55998.32, 57998.26] | 521 | 0.099808 | 0.001907 | 52.0 | 469.0 | 0.000886 | 0.002186 | 0.340162 | 0.026335 | 0.083479 | 0.020203 |
| 29 | (57998.26, 59998.2] | 485 | 0.107216 | 0.001775 | 52.0 | 433.0 | 0.000886 | 0.002018 | 0.363852 | 0.007408 | 0.023690 | 0.020203 |
| 30 | (59998.2, 61998.14] | 404 | 0.150990 | 0.001478 | 61.0 | 343.0 | 0.001039 | 0.001599 | 0.500715 | 0.043774 | 0.136864 | 0.020203 |
| 31 | (61998.14, 63998.08] | 367 | 0.100817 | 0.001343 | 37.0 | 330.0 | 0.000630 | 0.001538 | 0.343399 | 0.050173 | 0.157316 | 0.020203 |
| 32 | (63998.08, 65998.02] | 319 | 0.119122 | 0.001167 | 38.0 | 281.0 | 0.000647 | 0.001310 | 0.401580 | 0.018305 | 0.058181 | 0.020203 |
| 33 | (65998.02, 67997.96] | 307 | 0.110749 | 0.001124 | 34.0 | 273.0 | 0.000579 | 0.001272 | 0.375090 | 0.008373 | 0.026491 | 0.020203 |
| 34 | (67997.96, 69997.9] | 256 | 0.117188 | 0.000937 | 30.0 | 226.0 | 0.000511 | 0.001053 | 0.395477 | 0.006438 | 0.020387 | 0.020203 |
| 35 | (69997.9, 71997.84] | 243 | 0.135802 | 0.000889 | 33.0 | 210.0 | 0.000562 | 0.000979 | 0.453783 | 0.018615 | 0.058306 | 0.020203 |
| 36 | (71997.84, 73997.78] | 246 | 0.126016 | 0.000900 | 31.0 | 215.0 | 0.000528 | 0.001002 | 0.423245 | 0.009786 | 0.030539 | 0.020203 |
| 37 | (73997.78, 75997.72] | 194 | 0.092784 | 0.000710 | 18.0 | 176.0 | 0.000307 | 0.000820 | 0.317538 | 0.033233 | 0.105707 | 0.020203 |
| 38 | (75997.72, 77997.66] | 205 | 0.097561 | 0.000750 | 20.0 | 185.0 | 0.000341 | 0.000862 | 0.332942 | 0.004777 | 0.015404 | 0.020203 |
| 39 | (77997.66, 79997.6] | 183 | 0.092896 | 0.000670 | 17.0 | 166.0 | 0.000290 | 0.000774 | 0.317902 | 0.004665 | 0.015040 | 0.020203 |
| 40 | (79997.6, 81997.54] | 161 | 0.136646 | 0.000589 | 22.0 | 139.0 | 0.000375 | 0.000648 | 0.456404 | 0.043750 | 0.138502 | 0.020203 |
| 41 | (81997.54, 83997.48] | 144 | 0.097222 | 0.000527 | 14.0 | 130.0 | 0.000238 | 0.000606 | 0.331852 | 0.039424 | 0.124552 | 0.020203 |
| 42 | (83997.48, 85997.42] | 140 | 0.128571 | 0.000512 | 18.0 | 122.0 | 0.000307 | 0.000569 | 0.431242 | 0.031349 | 0.099390 | 0.020203 |
| 43 | (85997.42, 87997.36] | 100 | 0.060000 | 0.000366 | 6.0 | 94.0 | 0.000102 | 0.000438 | 0.209659 | 0.068571 | 0.221583 | 0.020203 |
| 44 | (87997.36, 89997.3] | 130 | 0.123077 | 0.000476 | 16.0 | 114.0 | 0.000273 | 0.000531 | 0.414024 | 0.063077 | 0.204365 | 0.020203 |
| 45 | (89997.3, 91997.24] | 111 | 0.099099 | 0.000406 | 11.0 | 100.0 | 0.000187 | 0.000466 | 0.337885 | 0.023978 | 0.076138 | 0.020203 |
| 46 | (91997.24, 93997.18] | 95 | 0.105263 | 0.000348 | 10.0 | 85.0 | 0.000170 | 0.000396 | 0.357622 | 0.006164 | 0.019737 | 0.020203 |
| 47 | (93997.18, 95997.12] | 82 | 0.085366 | 0.000300 | 7.0 | 75.0 | 0.000119 | 0.000350 | 0.293471 | 0.019897 | 0.064151 | 0.020203 |
| 48 | (95997.12, 97997.06] | 73 | 0.095890 | 0.000267 | 7.0 | 66.0 | 0.000119 | 0.000308 | 0.327564 | 0.010525 | 0.034093 | 0.020203 |
| 49 | (97997.06, 99997.0] | 75 | 0.133333 | 0.000274 | 10.0 | 65.0 | 0.000170 | 0.000303 | 0.446101 | 0.037443 | 0.118537 | 0.020203 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '< 5000', '5000 - 15000', '15000 - 30000', '30000 - 45000', '45000 - 100000', '> 100000'
df_inputs_prepr['bc_open_to_buy:0-5k'] = np.where((df_inputs_prepr['bc_open_to_buy'] >= 0.) & (df_inputs_prepr['bc_open_to_buy'] <= 5000.), 1, 0)
df_inputs_prepr['bc_open_to_buy:5-15k'] = np.where((df_inputs_prepr['bc_open_to_buy'] > 5000.) & (df_inputs_prepr['bc_open_to_buy'] <= 15000.), 1, 0)
df_inputs_prepr['bc_open_to_buy:15-30k'] = np.where((df_inputs_prepr['bc_open_to_buy'] > 15000.) & (df_inputs_prepr['bc_open_to_buy'] <= 30000.), 1, 0)
df_inputs_prepr['bc_open_to_buy:30-50k'] = np.where((df_inputs_prepr['bc_open_to_buy'] > 30000.) & (df_inputs_prepr['bc_open_to_buy'] <= 45000.), 1, 0)
df_inputs_prepr['bc_open_to_buy:50-100k'] = np.where((df_inputs_prepr['bc_open_to_buy'] > 45000.) & (df_inputs_prepr['bc_open_to_buy'] <= 100000.), 1, 0)
df_inputs_prepr['bc_open_to_buy:>100k'] = np.where((df_inputs_prepr['bc_open_to_buy'] > 100000.), 1, 0)
Variable: 'revol_bal_to_bc_limit'¶
# unique values
df_inputs_prepr['revol_bal_to_bc_limit'].nunique()
238525
# 'revol_bal_to_bc_limit'
df_inputs_prepr['revol_bal_to_bc_limit_factor'] = pd.cut(df_inputs_prepr['revol_bal_to_bc_limit'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# 'revol_bal_to_bc_limit'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'revol_bal_to_bc_limit_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| revol_bal_to_bc_limit_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.957, 19.135] | 270872 | 0.214216 | 0.998231 | 58025.0 | 212847.0 | 0.998125 | 0.998260 | 0.693080 | NaN | NaN | 0.00009 |
| 1 | (19.135, 38.269] | 354 | 0.225989 | 0.001305 | 80.0 | 274.0 | 0.001376 | 0.001285 | 0.727964 | 0.011773 | 0.034885 | 0.00009 |
| 2 | (38.269, 57.404] | 65 | 0.292308 | 0.000240 | 19.0 | 46.0 | 0.000327 | 0.000216 | 0.922241 | 0.066319 | 0.194276 | 0.00009 |
| 3 | (57.404, 76.538] | 25 | 0.240000 | 0.000092 | 6.0 | 19.0 | 0.000103 | 0.000089 | 0.769284 | 0.052308 | 0.152957 | 0.00009 |
| 4 | (76.538, 95.673] | 13 | 0.153846 | 0.000048 | 2.0 | 11.0 | 0.000034 | 0.000052 | 0.510938 | 0.086154 | 0.258346 | 0.00009 |
| 5 | (95.673, 114.807] | 8 | 0.125000 | 0.000029 | 1.0 | 7.0 | 0.000017 | 0.000033 | 0.421310 | 0.028846 | 0.089628 | 0.00009 |
| 6 | (114.807, 133.942] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.125000 | 0.421310 | 0.00009 |
| 7 | (133.942, 153.076] | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.000000 | 0.000000 | 0.00009 |
| 8 | (153.076, 172.211] | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.540666 | 0.500000 | 1.540666 | 0.00009 |
| 9 | (172.211, 191.345] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.500000 | 1.540666 | 0.00009 |
| 10 | (191.345, 210.48] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | 0.00009 |
| 11 | (210.48, 229.614] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | 0.00009 |
| 12 | (229.614, 248.749] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | 0.00009 |
| 13 | (248.749, 267.883] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 14 | (267.883, 287.018] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 15 | (287.018, 306.153] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | 0.00009 |
| 16 | (306.153, 325.287] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 17 | (325.287, 344.422] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 18 | (344.422, 363.556] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 19 | (363.556, 382.691] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 20 | (382.691, 401.825] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 21 | (401.825, 420.96] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 22 | (420.96, 440.094] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | 0.00009 |
| 23 | (440.094, 459.229] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 24 | (459.229, 478.363] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 25 | (478.363, 497.498] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 26 | (497.498, 516.632] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 27 | (516.632, 535.767] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 28 | (535.767, 554.901] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 29 | (554.901, 574.036] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 30 | (574.036, 593.171] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 31 | (593.171, 612.305] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 32 | (612.305, 631.44] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 33 | (631.44, 650.574] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 34 | (650.574, 669.709] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 35 | (669.709, 688.843] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | 0.00009 |
| 36 | (688.843, 707.978] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 37 | (707.978, 727.112] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 38 | (727.112, 746.247] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 39 | (746.247, 765.381] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 40 | (765.381, 784.516] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 41 | (784.516, 803.65] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 42 | (803.65, 822.785] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 43 | (822.785, 841.919] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 44 | (841.919, 861.054] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 45 | (861.054, 880.189] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 46 | (880.189, 899.323] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 47 | (899.323, 918.458] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 48 | (918.458, 937.592] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.00009 |
| 49 | (937.592, 956.727] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | 0.00009 |
# one other category will be created for 'revol_bal_to_bc_limit' value > 10.
#********************************
# 'revol_bal_to_bc_limit'
# the categories of everyone with 'revol_bal_to_bc_limit' less or equal 10.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['revol_bal_to_bc_limit'] <= 10, : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['revol_bal_to_bc_limit_factor'] = pd.cut(df_inputs_prepr_temp['revol_bal_to_bc_limit'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'revol_bal_to_bc_limit_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1955018008.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['revol_bal_to_bc_limit_factor'] = pd.cut(df_inputs_prepr_temp['revol_bal_to_bc_limit'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| revol_bal_to_bc_limit_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.01, 0.2] | 20300 | 0.158276 | 0.075207 | 3213.0 | 17087.0 | 0.055614 | 0.080543 | 0.525020 | NaN | NaN | 0.0144 |
| 1 | (0.2, 0.4] | 30809 | 0.173910 | 0.114141 | 5358.0 | 25451.0 | 0.092742 | 0.119968 | 0.572706 | 0.015634 | 0.047686 | 0.0144 |
| 2 | (0.4, 0.6] | 40014 | 0.197531 | 0.148243 | 7904.0 | 32110.0 | 0.136811 | 0.151357 | 0.643905 | 0.023621 | 0.071199 | 0.0144 |
| 3 | (0.6, 0.8] | 44351 | 0.212126 | 0.164311 | 9408.0 | 34943.0 | 0.162844 | 0.164710 | 0.687466 | 0.014595 | 0.043561 | 0.0144 |
| 4 | (0.8, 1.0] | 48756 | 0.224342 | 0.180631 | 10938.0 | 37818.0 | 0.189327 | 0.178262 | 0.723711 | 0.012216 | 0.036245 | 0.0144 |
| 5 | (1.0, 1.2] | 28863 | 0.241694 | 0.106931 | 6976.0 | 21887.0 | 0.120748 | 0.103169 | 0.774911 | 0.017352 | 0.051201 | 0.0144 |
| 6 | (1.2, 1.4] | 15821 | 0.246950 | 0.058613 | 3907.0 | 11914.0 | 0.067627 | 0.056159 | 0.790366 | 0.005257 | 0.015455 | 0.0144 |
| 7 | (1.4, 1.6] | 9904 | 0.247274 | 0.036692 | 2449.0 | 7455.0 | 0.042390 | 0.035141 | 0.791317 | 0.000324 | 0.000951 | 0.0144 |
| 8 | (1.6, 1.8] | 6701 | 0.255932 | 0.024826 | 1715.0 | 4986.0 | 0.029685 | 0.023502 | 0.816720 | 0.008658 | 0.025404 | 0.0144 |
| 9 | (1.8, 2.0] | 4674 | 0.241335 | 0.017316 | 1128.0 | 3546.0 | 0.019525 | 0.016715 | 0.773857 | 0.014597 | 0.042864 | 0.0144 |
| 10 | (2.0, 2.2] | 3362 | 0.237359 | 0.012455 | 798.0 | 2564.0 | 0.013813 | 0.012086 | 0.762149 | 0.003976 | 0.011708 | 0.0144 |
| 11 | (2.2, 2.4] | 2546 | 0.260408 | 0.009432 | 663.0 | 1883.0 | 0.011476 | 0.008876 | 0.829833 | 0.023050 | 0.067685 | 0.0144 |
| 12 | (2.4, 2.6] | 1939 | 0.242909 | 0.007184 | 471.0 | 1468.0 | 0.008153 | 0.006920 | 0.778486 | 0.017500 | 0.051347 | 0.0144 |
| 13 | (2.6, 2.8] | 1580 | 0.235443 | 0.005854 | 372.0 | 1208.0 | 0.006439 | 0.005694 | 0.756503 | 0.007466 | 0.021984 | 0.0144 |
| 14 | (2.8, 3.0] | 1329 | 0.253574 | 0.004924 | 337.0 | 992.0 | 0.005833 | 0.004676 | 0.809808 | 0.018131 | 0.053305 | 0.0144 |
| 15 | (3.0, 3.2] | 1068 | 0.230337 | 0.003957 | 246.0 | 822.0 | 0.004258 | 0.003875 | 0.741436 | 0.023237 | 0.068371 | 0.0144 |
| 16 | (3.2, 3.4] | 903 | 0.241417 | 0.003345 | 218.0 | 685.0 | 0.003773 | 0.003229 | 0.774099 | 0.011080 | 0.032663 | 0.0144 |
| 17 | (3.4, 3.6] | 781 | 0.247119 | 0.002893 | 193.0 | 588.0 | 0.003341 | 0.002772 | 0.790862 | 0.005702 | 0.016763 | 0.0144 |
| 18 | (3.6, 3.8] | 670 | 0.232836 | 0.002482 | 156.0 | 514.0 | 0.002700 | 0.002423 | 0.748813 | 0.014283 | 0.042049 | 0.0144 |
| 19 | (3.8, 4.0] | 545 | 0.260550 | 0.002019 | 142.0 | 403.0 | 0.002458 | 0.001900 | 0.830249 | 0.027715 | 0.081436 | 0.0144 |
| 20 | (4.0, 4.2] | 491 | 0.254582 | 0.001819 | 125.0 | 366.0 | 0.002164 | 0.001725 | 0.812765 | 0.005968 | 0.017484 | 0.0144 |
| 21 | (4.2, 4.4] | 442 | 0.212670 | 0.001638 | 94.0 | 348.0 | 0.001627 | 0.001640 | 0.689083 | 0.041913 | 0.123682 | 0.0144 |
| 22 | (4.4, 4.6] | 416 | 0.194712 | 0.001541 | 81.0 | 335.0 | 0.001402 | 0.001579 | 0.635454 | 0.017958 | 0.053628 | 0.0144 |
| 23 | (4.6, 4.8] | 363 | 0.258953 | 0.001345 | 94.0 | 269.0 | 0.001627 | 0.001268 | 0.825572 | 0.064242 | 0.190117 | 0.0144 |
| 24 | (4.8, 5.0] | 285 | 0.266667 | 0.001056 | 76.0 | 209.0 | 0.001315 | 0.000985 | 0.848144 | 0.007713 | 0.022572 | 0.0144 |
| 25 | (5.0, 5.2] | 262 | 0.217557 | 0.000971 | 57.0 | 205.0 | 0.000987 | 0.000966 | 0.703603 | 0.049109 | 0.144540 | 0.0144 |
| 26 | (5.2, 5.4] | 237 | 0.164557 | 0.000878 | 39.0 | 198.0 | 0.000675 | 0.000933 | 0.544236 | 0.053000 | 0.159367 | 0.0144 |
| 27 | (5.4, 5.6] | 239 | 0.271967 | 0.000885 | 65.0 | 174.0 | 0.001125 | 0.000820 | 0.863632 | 0.107410 | 0.319396 | 0.0144 |
| 28 | (5.6, 5.8] | 201 | 0.179104 | 0.000745 | 36.0 | 165.0 | 0.000623 | 0.000778 | 0.588445 | 0.092862 | 0.275188 | 0.0144 |
| 29 | (5.8, 6.0] | 201 | 0.253731 | 0.000745 | 51.0 | 150.0 | 0.000883 | 0.000707 | 0.810269 | 0.074627 | 0.221824 | 0.0144 |
| 30 | (6.0, 6.2] | 163 | 0.251534 | 0.000604 | 41.0 | 122.0 | 0.000710 | 0.000575 | 0.803823 | 0.002198 | 0.006446 | 0.0144 |
| 31 | (6.2, 6.4] | 146 | 0.321918 | 0.000541 | 47.0 | 99.0 | 0.000814 | 0.000467 | 1.009168 | 0.070384 | 0.205345 | 0.0144 |
| 32 | (6.4, 6.599] | 143 | 0.265734 | 0.000530 | 38.0 | 105.0 | 0.000658 | 0.000495 | 0.845417 | 0.056184 | 0.163751 | 0.0144 |
| 33 | (6.599, 6.799] | 146 | 0.273973 | 0.000541 | 40.0 | 106.0 | 0.000692 | 0.000500 | 0.869491 | 0.008238 | 0.024074 | 0.0144 |
| 34 | (6.799, 6.999] | 125 | 0.200000 | 0.000463 | 25.0 | 100.0 | 0.000433 | 0.000471 | 0.651295 | 0.073973 | 0.218196 | 0.0144 |
| 35 | (6.999, 7.199] | 134 | 0.276119 | 0.000496 | 37.0 | 97.0 | 0.000640 | 0.000457 | 0.875759 | 0.076119 | 0.224463 | 0.0144 |
| 36 | (7.199, 7.399] | 120 | 0.200000 | 0.000445 | 24.0 | 96.0 | 0.000415 | 0.000453 | 0.651295 | 0.076119 | 0.224463 | 0.0144 |
| 37 | (7.399, 7.599] | 84 | 0.297619 | 0.000311 | 25.0 | 59.0 | 0.000433 | 0.000278 | 0.938433 | 0.097619 | 0.287137 | 0.0144 |
| 38 | (7.599, 7.799] | 104 | 0.230769 | 0.000385 | 24.0 | 80.0 | 0.000415 | 0.000377 | 0.742713 | 0.066850 | 0.195720 | 0.0144 |
| 39 | (7.799, 7.999] | 106 | 0.216981 | 0.000393 | 23.0 | 83.0 | 0.000398 | 0.000391 | 0.701893 | 0.013788 | 0.040819 | 0.0144 |
| 40 | (7.999, 8.199] | 80 | 0.212500 | 0.000296 | 17.0 | 63.0 | 0.000294 | 0.000297 | 0.688578 | 0.004481 | 0.013315 | 0.0144 |
| 41 | (8.199, 8.399] | 71 | 0.295775 | 0.000263 | 21.0 | 50.0 | 0.000363 | 0.000236 | 0.933061 | 0.083275 | 0.244483 | 0.0144 |
| 42 | (8.399, 8.599] | 63 | 0.190476 | 0.000233 | 12.0 | 51.0 | 0.000208 | 0.000240 | 0.622737 | 0.105298 | 0.310325 | 0.0144 |
| 43 | (8.599, 8.799] | 67 | 0.238806 | 0.000248 | 16.0 | 51.0 | 0.000277 | 0.000240 | 0.766412 | 0.048330 | 0.143675 | 0.0144 |
| 44 | (8.799, 8.999] | 56 | 0.214286 | 0.000207 | 12.0 | 44.0 | 0.000208 | 0.000207 | 0.693887 | 0.024520 | 0.072524 | 0.0144 |
| 45 | (8.999, 9.199] | 72 | 0.194444 | 0.000267 | 14.0 | 58.0 | 0.000242 | 0.000273 | 0.634653 | 0.019841 | 0.059234 | 0.0144 |
| 46 | (9.199, 9.399] | 48 | 0.333333 | 0.000178 | 16.0 | 32.0 | 0.000277 | 0.000151 | 1.042412 | 0.138889 | 0.407758 | 0.0144 |
| 47 | (9.399, 9.599] | 55 | 0.254545 | 0.000204 | 14.0 | 41.0 | 0.000242 | 0.000193 | 0.812656 | 0.078788 | 0.229756 | 0.0144 |
| 48 | (9.599, 9.799] | 43 | 0.209302 | 0.000159 | 9.0 | 34.0 | 0.000156 | 0.000160 | 0.679061 | 0.045243 | 0.133595 | 0.0144 |
| 49 | (9.799, 9.999] | 42 | 0.190476 | 0.000156 | 8.0 | 34.0 | 0.000138 | 0.000160 | 0.622737 | 0.018826 | 0.056324 | 0.0144 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '< 0.6', '0.6 - 1.2', '1.2 - 3.6', '3.6- 5.5', '5.5 - 10.', '> 10.'
df_inputs_prepr['revol_bal_to_bc_limit:0-0.6'] = np.where((df_inputs_prepr['revol_bal_to_bc_limit'] <= 0.6), 1, 0)
df_inputs_prepr['revol_bal_to_bc_limit:0.6-1.2'] = np.where((df_inputs_prepr['revol_bal_to_bc_limit'] > 0.6) & (df_inputs_prepr['revol_bal_to_bc_limit'] <= 1.2), 1, 0)
df_inputs_prepr['revol_bal_to_bc_limit:1.2-3.6'] = np.where((df_inputs_prepr['revol_bal_to_bc_limit'] > 1.2) & (df_inputs_prepr['revol_bal_to_bc_limit'] <= 3.4), 1, 0)
df_inputs_prepr['revol_bal_to_bc_limit:3.6-5.5'] = np.where((df_inputs_prepr['revol_bal_to_bc_limit'] > 3.4) & (df_inputs_prepr['revol_bal_to_bc_limit'] <= 5.5), 1, 0)
df_inputs_prepr['revol_bal_to_bc_limit:5.5-10.'] = np.where((df_inputs_prepr['revol_bal_to_bc_limit'] > 5.5) & (df_inputs_prepr['revol_bal_to_bc_limit'] <= 10.), 1, 0)
df_inputs_prepr['revol_bal_to_bc_limit:>10.'] = np.where((df_inputs_prepr['revol_bal_to_bc_limit'] > 10.), 1, 0)
df_inputs_prepr = df_inputs_prepr.drop(columns = ['revol_bal_to_bc_limit_factor'])
# Drop the provisory feature
Variable: 'revol_bal_to_open_to_buy'¶
# unique values
df_inputs_prepr['revol_bal_to_open_to_buy'].nunique()
260618
# 'revol_bal_to_bc_limit'
df_inputs_prepr['revol_bal_to_open_to_buy_factor'] = pd.cut(df_inputs_prepr['revol_bal_to_open_to_buy'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# 'revol_bal_to_bc_limit'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'revol_bal_to_open_to_buy_factor', df_targets_prepr)
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| revol_bal_to_open_to_buy_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-36.021, 720.42] | 269193 | 0.213092 | 0.997410 | 57363.0 | 211830.0 | 0.996681 | 0.997608 | 0.692683 | NaN | NaN | inf |
| 1 | (720.42, 1440.84] | 377 | 0.262599 | 0.001397 | 99.0 | 278.0 | 0.001720 | 0.001309 | 0.838909 | 0.049507 | 0.146226 | inf |
| 2 | (1440.84, 2161.26] | 120 | 0.308333 | 0.000445 | 37.0 | 83.0 | 0.000643 | 0.000391 | 0.972542 | 0.045734 | 0.133633 | inf |
| 3 | (2161.26, 2881.68] | 53 | 0.320755 | 0.000196 | 17.0 | 36.0 | 0.000295 | 0.000170 | 1.008761 | 0.012421 | 0.036219 | inf |
| 4 | (2881.68, 3602.1] | 35 | 0.200000 | 0.000130 | 7.0 | 28.0 | 0.000122 | 0.000132 | 0.653544 | 0.120755 | 0.355217 | inf |
| 5 | (3602.1, 4322.52] | 25 | 0.280000 | 0.000093 | 7.0 | 18.0 | 0.000122 | 0.000085 | 0.889846 | 0.080000 | 0.236302 | inf |
| 6 | (4322.52, 5042.94] | 20 | 0.250000 | 0.000074 | 5.0 | 15.0 | 0.000087 | 0.000071 | 0.801907 | 0.030000 | 0.087939 | inf |
| 7 | (5042.94, 5763.36] | 13 | 0.384615 | 0.000048 | 5.0 | 8.0 | 0.000087 | 0.000038 | 1.195696 | 0.134615 | 0.393788 | inf |
| 8 | (5763.36, 6483.78] | 7 | 0.142857 | 0.000026 | 1.0 | 6.0 | 0.000017 | 0.000028 | 0.479270 | 0.241758 | 0.716426 | inf |
| 9 | (6483.78, 7204.2] | 6 | 0.166667 | 0.000022 | 1.0 | 5.0 | 0.000017 | 0.000024 | 0.552663 | 0.023810 | 0.073393 | inf |
| 10 | (7204.2, 7924.62] | 7 | 0.285714 | 0.000026 | 2.0 | 5.0 | 0.000035 | 0.000024 | 0.906543 | 0.119048 | 0.353880 | inf |
| 11 | (7924.62, 8645.04] | 5 | 0.400000 | 0.000019 | 2.0 | 3.0 | 0.000035 | 0.000014 | 1.241147 | 0.114286 | 0.334605 | inf |
| 12 | (8645.04, 9365.46] | 5 | 0.000000 | 0.000019 | 0.0 | 5.0 | 0.000000 | 0.000024 | 0.000000 | 0.400000 | 1.241147 | inf |
| 13 | (9365.46, 10085.88] | 4 | 0.500000 | 0.000015 | 2.0 | 2.0 | 0.000035 | 0.000009 | 1.545298 | 0.500000 | 1.545298 | inf |
| 14 | (10085.88, 10806.3] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.500000 | 1.545298 | inf |
| 15 | (10806.3, 11526.72] | 3 | 0.333333 | 0.000011 | 1.0 | 2.0 | 0.000017 | 0.000009 | 1.045452 | 0.333333 | 1.045452 | inf |
| 16 | (11526.72, 12247.14] | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.545298 | 0.166667 | 0.499846 | inf |
| 17 | (12247.14, 12967.56] | 5 | 0.200000 | 0.000019 | 1.0 | 4.0 | 0.000017 | 0.000019 | 0.653544 | 0.300000 | 0.891754 | inf |
| 18 | (12967.56, 13687.98] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.200000 | 0.653544 | inf |
| 19 | (13687.98, 14408.4] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 20 | (14408.4, 15128.82] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 21 | (15128.82, 15849.24] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 22 | (15849.24, 16569.66] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 23 | (16569.66, 17290.08] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 24 | (17290.08, 18010.5] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 25 | (18010.5, 18730.92] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 26 | (18730.92, 19451.34] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 27 | (19451.34, 20171.76] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 28 | (20171.76, 20892.18] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 29 | (20892.18, 21612.6] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 30 | (21612.6, 22333.02] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | inf |
| 31 | (22333.02, 23053.44] | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 32 | (23053.44, 23773.86] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 33 | (23773.86, 24494.28] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 34 | (24494.28, 25214.7] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 35 | (25214.7, 25935.12] | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | NaN | NaN | inf |
| 36 | (25935.12, 26655.54] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 37 | (26655.54, 27375.96] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 38 | (27375.96, 28096.38] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 39 | (28096.38, 28816.8] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 40 | (28816.8, 29537.22] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 41 | (29537.22, 30257.64] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 42 | (30257.64, 30978.06] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 43 | (30978.06, 31698.48] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 44 | (31698.48, 32418.9] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 45 | (32418.9, 33139.32] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 46 | (33139.32, 33859.74] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 47 | (33859.74, 34580.16] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 48 | (34580.16, 35300.58] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | inf |
| 49 | (35300.58, 36021.0] | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.545298 | NaN | NaN | inf |
# one other category will be created for 'revol_bal_to_open_to_buy' value > 100.
#********************************
# 'revol_bal_to_open_to_buy'
# the categories of everyone with 'revol_bal_to_open_to_buy' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['revol_bal_to_open_to_buy'] <= 100, : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['revol_bal_to_open_to_buy_factor'] = pd.cut(df_inputs_prepr_temp['revol_bal_to_open_to_buy'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'revol_bal_to_open_to_buy_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3664876170.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['revol_bal_to_open_to_buy_factor'] = pd.cut(df_inputs_prepr_temp['revol_bal_to_open_to_buy'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| revol_bal_to_open_to_buy_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.1, 1.999] | 125639 | 0.190275 | 0.477479 | 23906.0 | 101733.0 | 0.428246 | 0.490736 | 0.627361 | NaN | NaN | 0.009295 |
| 1 | (1.999, 3.999] | 47240 | 0.222671 | 0.179531 | 10519.0 | 36721.0 | 0.188435 | 0.177133 | 0.724550 | 0.032396 | 0.097189 | 0.009295 |
| 2 | (3.999, 5.998] | 22821 | 0.229745 | 0.086729 | 5243.0 | 17578.0 | 0.093922 | 0.084792 | 0.745584 | 0.007073 | 0.021034 | 0.009295 |
| 3 | (5.998, 7.998] | 13508 | 0.232381 | 0.051336 | 3139.0 | 10369.0 | 0.056231 | 0.050018 | 0.753409 | 0.002636 | 0.007825 | 0.009295 |
| 4 | (7.998, 9.997] | 8727 | 0.231924 | 0.033166 | 2024.0 | 6703.0 | 0.036257 | 0.032334 | 0.752054 | 0.000457 | 0.001356 | 0.009295 |
| 5 | (9.997, 11.996] | 6368 | 0.235867 | 0.024201 | 1502.0 | 4866.0 | 0.026906 | 0.023472 | 0.763746 | 0.003943 | 0.011692 | 0.009295 |
| 6 | (11.996, 13.996] | 4904 | 0.232667 | 0.018637 | 1141.0 | 3763.0 | 0.020440 | 0.018152 | 0.754259 | 0.003200 | 0.009487 | 0.009295 |
| 7 | (13.996, 15.995] | 3858 | 0.245464 | 0.014662 | 947.0 | 2911.0 | 0.016964 | 0.014042 | 0.792140 | 0.012797 | 0.037880 | 0.009295 |
| 8 | (15.995, 17.995] | 3160 | 0.240190 | 0.012009 | 759.0 | 2401.0 | 0.013597 | 0.011582 | 0.776547 | 0.005274 | 0.015593 | 0.009295 |
| 9 | (17.995, 19.994] | 2683 | 0.242639 | 0.010196 | 651.0 | 2032.0 | 0.011662 | 0.009802 | 0.783790 | 0.002449 | 0.007244 | 0.009295 |
| 10 | (19.994, 21.993] | 2228 | 0.230700 | 0.008467 | 514.0 | 1714.0 | 0.009208 | 0.008268 | 0.748422 | 0.011939 | 0.035369 | 0.009295 |
| 11 | (21.993, 23.993] | 1993 | 0.233818 | 0.007574 | 466.0 | 1527.0 | 0.008348 | 0.007366 | 0.757673 | 0.003118 | 0.009252 | 0.009295 |
| 12 | (23.993, 25.992] | 1719 | 0.248400 | 0.006533 | 427.0 | 1292.0 | 0.007649 | 0.006232 | 0.800810 | 0.014582 | 0.043136 | 0.009295 |
| 13 | (25.992, 27.992] | 1513 | 0.235294 | 0.005750 | 356.0 | 1157.0 | 0.006377 | 0.005581 | 0.762049 | 0.013106 | 0.038761 | 0.009295 |
| 14 | (27.992, 29.991] | 1402 | 0.233951 | 0.005328 | 328.0 | 1074.0 | 0.005876 | 0.005181 | 0.758068 | 0.001343 | 0.003980 | 0.009295 |
| 15 | (29.991, 31.99] | 1218 | 0.270936 | 0.004629 | 330.0 | 888.0 | 0.005912 | 0.004284 | 0.867131 | 0.036984 | 0.109063 | 0.009295 |
| 16 | (31.99, 33.99] | 1142 | 0.243433 | 0.004340 | 278.0 | 864.0 | 0.004980 | 0.004168 | 0.786137 | 0.027503 | 0.080994 | 0.009295 |
| 17 | (33.99, 35.989] | 972 | 0.232510 | 0.003694 | 226.0 | 746.0 | 0.004049 | 0.003599 | 0.753794 | 0.010922 | 0.032343 | 0.009295 |
| 18 | (35.989, 37.989] | 906 | 0.250552 | 0.003443 | 227.0 | 679.0 | 0.004066 | 0.003275 | 0.807158 | 0.018042 | 0.053365 | 0.009295 |
| 19 | (37.989, 39.988] | 848 | 0.247642 | 0.003223 | 210.0 | 638.0 | 0.003762 | 0.003078 | 0.798570 | 0.002910 | 0.008588 | 0.009295 |
| 20 | (39.988, 41.987] | 737 | 0.226594 | 0.002801 | 167.0 | 570.0 | 0.002992 | 0.002750 | 0.736223 | 0.021047 | 0.062347 | 0.009295 |
| 21 | (41.987, 43.987] | 743 | 0.298789 | 0.002824 | 222.0 | 521.0 | 0.003977 | 0.002513 | 0.948719 | 0.072194 | 0.212496 | 0.009295 |
| 22 | (43.987, 45.986] | 623 | 0.243981 | 0.002368 | 152.0 | 471.0 | 0.002723 | 0.002272 | 0.787757 | 0.054808 | 0.160962 | 0.009295 |
| 23 | (45.986, 47.986] | 613 | 0.275693 | 0.002330 | 169.0 | 444.0 | 0.003027 | 0.002142 | 0.881090 | 0.031713 | 0.093333 | 0.009295 |
| 24 | (47.986, 49.985] | 540 | 0.233333 | 0.002052 | 126.0 | 414.0 | 0.002257 | 0.001997 | 0.756235 | 0.042360 | 0.124855 | 0.009295 |
| 25 | (49.985, 51.984] | 520 | 0.232692 | 0.001976 | 121.0 | 399.0 | 0.002168 | 0.001925 | 0.754334 | 0.000641 | 0.001901 | 0.009295 |
| 26 | (51.984, 53.984] | 503 | 0.246521 | 0.001912 | 124.0 | 379.0 | 0.002221 | 0.001828 | 0.795261 | 0.013829 | 0.040928 | 0.009295 |
| 27 | (53.984, 55.983] | 467 | 0.256959 | 0.001775 | 120.0 | 347.0 | 0.002150 | 0.001674 | 0.826042 | 0.010438 | 0.030780 | 0.009295 |
| 28 | (55.983, 57.983] | 439 | 0.277904 | 0.001668 | 122.0 | 317.0 | 0.002185 | 0.001529 | 0.887573 | 0.020945 | 0.061532 | 0.009295 |
| 29 | (57.983, 59.982] | 398 | 0.261307 | 0.001513 | 104.0 | 294.0 | 0.001863 | 0.001418 | 0.838836 | 0.016598 | 0.048738 | 0.009295 |
| 30 | (59.982, 61.981] | 343 | 0.204082 | 0.001304 | 70.0 | 273.0 | 0.001254 | 0.001317 | 0.668966 | 0.057225 | 0.169870 | 0.009295 |
| 31 | (61.981, 63.981] | 344 | 0.220930 | 0.001307 | 76.0 | 268.0 | 0.001361 | 0.001293 | 0.719363 | 0.016849 | 0.050397 | 0.009295 |
| 32 | (63.981, 65.98] | 325 | 0.261538 | 0.001235 | 85.0 | 240.0 | 0.001523 | 0.001158 | 0.839518 | 0.040608 | 0.120155 | 0.009295 |
| 33 | (65.98, 67.98] | 308 | 0.266234 | 0.001171 | 82.0 | 226.0 | 0.001469 | 0.001090 | 0.853321 | 0.004695 | 0.013803 | 0.009295 |
| 34 | (67.98, 69.979] | 303 | 0.277228 | 0.001152 | 84.0 | 219.0 | 0.001505 | 0.001056 | 0.885589 | 0.010994 | 0.032268 | 0.009295 |
| 35 | (69.979, 71.978] | 273 | 0.223443 | 0.001038 | 61.0 | 212.0 | 0.001093 | 0.001023 | 0.726848 | 0.053784 | 0.158742 | 0.009295 |
| 36 | (71.978, 73.978] | 255 | 0.254902 | 0.000969 | 65.0 | 190.0 | 0.001164 | 0.000917 | 0.819982 | 0.031459 | 0.093134 | 0.009295 |
| 37 | (73.978, 75.977] | 251 | 0.243028 | 0.000954 | 61.0 | 190.0 | 0.001093 | 0.000917 | 0.784941 | 0.011874 | 0.035041 | 0.009295 |
| 38 | (75.977, 77.977] | 220 | 0.236364 | 0.000836 | 52.0 | 168.0 | 0.000932 | 0.000810 | 0.765218 | 0.006664 | 0.019723 | 0.009295 |
| 39 | (77.977, 79.976] | 240 | 0.295833 | 0.000912 | 71.0 | 169.0 | 0.001272 | 0.000815 | 0.940074 | 0.059470 | 0.174857 | 0.009295 |
| 40 | (79.976, 81.975] | 219 | 0.228311 | 0.000832 | 50.0 | 169.0 | 0.000896 | 0.000815 | 0.741324 | 0.067523 | 0.198750 | 0.009295 |
| 41 | (81.975, 83.975] | 206 | 0.281553 | 0.000783 | 58.0 | 148.0 | 0.001039 | 0.000714 | 0.898269 | 0.053243 | 0.156945 | 0.009295 |
| 42 | (83.975, 85.974] | 212 | 0.221698 | 0.000806 | 47.0 | 165.0 | 0.000842 | 0.000796 | 0.721651 | 0.059855 | 0.176618 | 0.009295 |
| 43 | (85.974, 87.974] | 217 | 0.317972 | 0.000825 | 69.0 | 148.0 | 0.001236 | 0.000714 | 1.004801 | 0.096274 | 0.283150 | 0.009295 |
| 44 | (87.974, 89.973] | 177 | 0.265537 | 0.000673 | 47.0 | 130.0 | 0.000842 | 0.000627 | 0.851273 | 0.052436 | 0.153528 | 0.009295 |
| 45 | (89.973, 91.972] | 185 | 0.302703 | 0.000703 | 56.0 | 129.0 | 0.001003 | 0.000622 | 0.960165 | 0.037166 | 0.108892 | 0.009295 |
| 46 | (91.972, 93.972] | 154 | 0.266234 | 0.000585 | 41.0 | 113.0 | 0.000734 | 0.000545 | 0.853321 | 0.036469 | 0.106844 | 0.009295 |
| 47 | (93.972, 95.971] | 156 | 0.275641 | 0.000593 | 43.0 | 113.0 | 0.000770 | 0.000545 | 0.880936 | 0.009407 | 0.027615 | 0.009295 |
| 48 | (95.971, 97.971] | 147 | 0.265306 | 0.000559 | 39.0 | 108.0 | 0.000699 | 0.000521 | 0.850595 | 0.010335 | 0.030341 | 0.009295 |
| 49 | (97.971, 99.97] | 163 | 0.282209 | 0.000619 | 46.0 | 117.0 | 0.000824 | 0.000564 | 0.900189 | 0.016902 | 0.049593 | 0.009295 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '< 2', '2 - 4', '4 - 20', '20 - 100', '> 100.'
df_inputs_prepr['revol_bal_to_open_to_buy:0-2'] = np.where((df_inputs_prepr['revol_bal_to_open_to_buy'] <= 2.), 1, 0)
df_inputs_prepr['revol_bal_to_open_to_buy:2-4'] = np.where((df_inputs_prepr['revol_bal_to_open_to_buy'] > 2.) & (df_inputs_prepr['revol_bal_to_open_to_buy'] <= 4.), 1, 0)
df_inputs_prepr['revol_bal_to_open_to_buy:4-20'] = np.where((df_inputs_prepr['revol_bal_to_open_to_buy'] > 4.) & (df_inputs_prepr['revol_bal_to_open_to_buy'] <= 20.), 1, 0)
df_inputs_prepr['revol_bal_to_open_to_buy:20-100'] = np.where((df_inputs_prepr['revol_bal_to_open_to_buy'] > 20.) & (df_inputs_prepr['revol_bal_to_open_to_buy'] <= 100.), 1, 0)
df_inputs_prepr['revol_bal_to_open_to_buy:>100'] = np.where((df_inputs_prepr['revol_bal_to_open_to_buy'] > 100.), 1, 0)
df_inputs_prepr = df_inputs_prepr.drop(columns = ['revol_bal_to_open_to_buy_factor'])
# Drop the provisory feature
Variable: 'total_bal_ex_mort_to_inc'¶
# unique values
df_inputs_prepr['total_bal_ex_mort_to_inc'].max()
102819.0
# one other category will be created for 'total_bal_ex_mort_to_inc' value > 100.
#********************************
# 'total_bal_ex_mort_to_inc'
# the categories of everyone with 'total_bal_ex_mort_to_inc' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['total_bal_ex_mort_to_inc'] <= 10, : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['total_bal_ex_mort_to_inc_factor'] = pd.cut(df_inputs_prepr_temp['total_bal_ex_mort_to_inc'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'total_bal_ex_mort_to_inc_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3588619369.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['total_bal_ex_mort_to_inc_factor'] = pd.cut(df_inputs_prepr_temp['total_bal_ex_mort_to_inc'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| total_bal_ex_mort_to_inc_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.00994, 0.199] | 27084 | 0.178260 | 0.098807 | 4828.0 | 22256.0 | 0.082145 | 0.103355 | 0.584885 | NaN | NaN | 0.009508 |
| 1 | (0.199, 0.398] | 56628 | 0.188052 | 0.206589 | 10649.0 | 45979.0 | 0.181186 | 0.213523 | 0.614403 | 0.009792 | 2.951776e-02 | 0.009508 |
| 2 | (0.398, 0.597] | 60626 | 0.209201 | 0.221175 | 12683.0 | 47943.0 | 0.215793 | 0.222644 | 0.677642 | 0.021149 | 6.323896e-02 | 0.009508 |
| 3 | (0.597, 0.795] | 47141 | 0.232791 | 0.171979 | 10974.0 | 36167.0 | 0.186715 | 0.167957 | 0.747486 | 0.023590 | 6.984450e-02 | 0.009508 |
| 4 | (0.795, 0.994] | 30836 | 0.239558 | 0.112495 | 7387.0 | 23449.0 | 0.125685 | 0.108895 | 0.767410 | 0.006767 | 1.992332e-02 | 0.009508 |
| 5 | (0.994, 1.193] | 17928 | 0.239904 | 0.065405 | 4301.0 | 13627.0 | 0.073179 | 0.063283 | 0.768428 | 0.000346 | 1.018757e-03 | 0.009508 |
| 6 | (1.193, 1.392] | 10527 | 0.232165 | 0.038404 | 2444.0 | 8083.0 | 0.041583 | 0.037537 | 0.745641 | 0.007739 | 2.278773e-02 | 0.009508 |
| 7 | (1.392, 1.591] | 6491 | 0.233862 | 0.023680 | 1518.0 | 4973.0 | 0.025828 | 0.023094 | 0.750643 | 0.001697 | 5.002816e-03 | 0.009508 |
| 8 | (1.591, 1.79] | 4210 | 0.224703 | 0.015359 | 946.0 | 3264.0 | 0.016096 | 0.015158 | 0.723612 | 0.009159 | 2.703133e-02 | 0.009508 |
| 9 | (1.79, 1.988] | 2990 | 0.233779 | 0.010908 | 699.0 | 2291.0 | 0.011893 | 0.010639 | 0.750399 | 0.009076 | 2.678674e-02 | 0.009508 |
| 10 | (1.988, 2.187] | 2190 | 0.233333 | 0.007990 | 511.0 | 1679.0 | 0.008694 | 0.007797 | 0.749085 | 0.000446 | 1.314098e-03 | 0.009508 |
| 11 | (2.187, 2.386] | 1577 | 0.239062 | 0.005753 | 377.0 | 1200.0 | 0.006414 | 0.005573 | 0.765950 | 0.005728 | 1.686548e-02 | 0.009508 |
| 12 | (2.386, 2.585] | 1217 | 0.221857 | 0.004440 | 270.0 | 947.0 | 0.004594 | 0.004398 | 0.715194 | 0.017204 | 5.075619e-02 | 0.009508 |
| 13 | (2.585, 2.784] | 973 | 0.256937 | 0.003550 | 250.0 | 723.0 | 0.004254 | 0.003358 | 0.818399 | 0.035080 | 1.032047e-01 | 0.009508 |
| 14 | (2.784, 2.983] | 686 | 0.240525 | 0.002503 | 165.0 | 521.0 | 0.002807 | 0.002419 | 0.770254 | 0.016413 | 4.814512e-02 | 0.009508 |
| 15 | (2.983, 3.181] | 572 | 0.260490 | 0.002087 | 149.0 | 423.0 | 0.002535 | 0.001964 | 0.828793 | 0.019965 | 5.853888e-02 | 0.009508 |
| 16 | (3.181, 3.38] | 464 | 0.284483 | 0.001693 | 132.0 | 332.0 | 0.002246 | 0.001542 | 0.898812 | 0.023993 | 7.001976e-02 | 0.009508 |
| 17 | (3.38, 3.579] | 317 | 0.261830 | 0.001156 | 83.0 | 234.0 | 0.001412 | 0.001087 | 0.832712 | 0.022653 | 6.610065e-02 | 0.009508 |
| 18 | (3.579, 3.778] | 262 | 0.263359 | 0.000956 | 69.0 | 193.0 | 0.001174 | 0.000896 | 0.837182 | 0.001529 | 4.470404e-03 | 0.009508 |
| 19 | (3.778, 3.977] | 221 | 0.294118 | 0.000806 | 65.0 | 156.0 | 0.001106 | 0.000724 | 0.926865 | 0.030759 | 8.968256e-02 | 0.009508 |
| 20 | (3.977, 4.176] | 177 | 0.248588 | 0.000646 | 44.0 | 133.0 | 0.000749 | 0.000618 | 0.793932 | 0.045530 | 1.329325e-01 | 0.009508 |
| 21 | (4.176, 4.375] | 135 | 0.274074 | 0.000493 | 37.0 | 98.0 | 0.000630 | 0.000455 | 0.868471 | 0.025487 | 7.453876e-02 | 0.009508 |
| 22 | (4.375, 4.573] | 127 | 0.228346 | 0.000463 | 29.0 | 98.0 | 0.000493 | 0.000455 | 0.734375 | 0.045728 | 1.340955e-01 | 0.009508 |
| 23 | (4.573, 4.772] | 104 | 0.269231 | 0.000379 | 28.0 | 76.0 | 0.000476 | 0.000353 | 0.854336 | 0.040884 | 1.199606e-01 | 0.009508 |
| 24 | (4.772, 4.971] | 95 | 0.084211 | 0.000347 | 8.0 | 87.0 | 0.000136 | 0.000404 | 0.290353 | 0.185020 | 5.639830e-01 | 0.009508 |
| 25 | (4.971, 5.17] | 71 | 0.211268 | 0.000259 | 15.0 | 56.0 | 0.000255 | 0.000260 | 0.683788 | 0.127057 | 3.934354e-01 | 0.009508 |
| 26 | (5.17, 5.369] | 48 | 0.250000 | 0.000175 | 12.0 | 36.0 | 0.000204 | 0.000167 | 0.798075 | 0.038732 | 1.142863e-01 | 0.009508 |
| 27 | (5.369, 5.568] | 46 | 0.239130 | 0.000168 | 11.0 | 35.0 | 0.000187 | 0.000163 | 0.766153 | 0.010870 | 3.192155e-02 | 0.009508 |
| 28 | (5.568, 5.766] | 50 | 0.260000 | 0.000182 | 13.0 | 37.0 | 0.000221 | 0.000172 | 0.827361 | 0.020870 | 6.120768e-02 | 0.009508 |
| 29 | (5.766, 5.965] | 44 | 0.250000 | 0.000161 | 11.0 | 33.0 | 0.000187 | 0.000153 | 0.798075 | 0.010000 | 2.928614e-02 | 0.009508 |
| 30 | (5.965, 6.164] | 33 | 0.303030 | 0.000120 | 10.0 | 23.0 | 0.000170 | 0.000107 | 0.952795 | 0.053030 | 1.547208e-01 | 0.009508 |
| 31 | (6.164, 6.363] | 31 | 0.322581 | 0.000113 | 10.0 | 21.0 | 0.000170 | 0.000098 | 1.009656 | 0.019550 | 5.686078e-02 | 0.009508 |
| 32 | (6.363, 6.562] | 33 | 0.303030 | 0.000120 | 10.0 | 23.0 | 0.000170 | 0.000107 | 0.952795 | 0.019550 | 5.686078e-02 | 0.009508 |
| 33 | (6.562, 6.761] | 21 | 0.190476 | 0.000077 | 4.0 | 17.0 | 0.000068 | 0.000079 | 0.621687 | 0.112554 | 3.311088e-01 | 0.009508 |
| 34 | (6.761, 6.959] | 22 | 0.409091 | 0.000080 | 9.0 | 13.0 | 0.000153 | 0.000060 | 1.263127 | 0.218615 | 6.414405e-01 | 0.009508 |
| 35 | (6.959, 7.158] | 14 | 0.214286 | 0.000051 | 3.0 | 11.0 | 0.000051 | 0.000051 | 0.692753 | 0.194805 | 5.703736e-01 | 0.009508 |
| 36 | (7.158, 7.357] | 14 | 0.214286 | 0.000051 | 3.0 | 11.0 | 0.000051 | 0.000051 | 0.692753 | 0.000000 | 0.000000e+00 | 0.009508 |
| 37 | (7.357, 7.556] | 17 | 0.058824 | 0.000062 | 1.0 | 16.0 | 0.000017 | 0.000074 | 0.206190 | 0.155462 | 4.865638e-01 | 0.009508 |
| 38 | (7.556, 7.755] | 10 | 0.100000 | 0.000036 | 1.0 | 9.0 | 0.000017 | 0.000042 | 0.341521 | 0.041176 | 1.353317e-01 | 0.009508 |
| 39 | (7.755, 7.954] | 16 | 0.187500 | 0.000058 | 3.0 | 13.0 | 0.000051 | 0.000060 | 0.612744 | 0.087500 | 2.712222e-01 | 0.009508 |
| 40 | (7.954, 8.153] | 3 | 0.333333 | 0.000011 | 1.0 | 2.0 | 0.000017 | 0.000009 | 1.040944 | 0.145833 | 4.282008e-01 | 0.009508 |
| 41 | (8.153, 8.351] | 9 | 0.333333 | 0.000033 | 3.0 | 6.0 | 0.000051 | 0.000028 | 1.040944 | 0.000000 | 2.220446e-16 | 0.009508 |
| 42 | (8.351, 8.55] | 10 | 0.100000 | 0.000036 | 1.0 | 9.0 | 0.000017 | 0.000042 | 0.341521 | 0.233333 | 6.994230e-01 | 0.009508 |
| 43 | (8.55, 8.749] | 6 | 0.333333 | 0.000022 | 2.0 | 4.0 | 0.000034 | 0.000019 | 1.040944 | 0.233333 | 6.994230e-01 | 0.009508 |
| 44 | (8.749, 8.948] | 4 | 0.250000 | 0.000015 | 1.0 | 3.0 | 0.000017 | 0.000014 | 0.798075 | 0.083333 | 2.428697e-01 | 0.009508 |
| 45 | (8.948, 9.147] | 8 | 0.000000 | 0.000029 | 0.0 | 8.0 | 0.000000 | 0.000037 | 0.000000 | 0.250000 | 7.980746e-01 | 0.009508 |
| 46 | (9.147, 9.346] | 8 | 0.375000 | 0.000029 | 3.0 | 5.0 | 0.000051 | 0.000023 | 1.162609 | 0.375000 | 1.162609e+00 | 0.009508 |
| 47 | (9.346, 9.544] | 6 | 0.166667 | 0.000022 | 1.0 | 5.0 | 0.000017 | 0.000023 | 0.549713 | 0.208333 | 6.128962e-01 | 0.009508 |
| 48 | (9.544, 9.743] | 4 | 0.000000 | 0.000015 | 0.0 | 4.0 | 0.000000 | 0.000019 | 0.000000 | 0.166667 | 5.497132e-01 | 0.009508 |
| 49 | (9.743, 9.942] | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.000000 | 0.000000e+00 | 0.009508 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '< 0.4', '0.4 - 1', '1 - 2.6', '2.6 - 4.4', '> 4.4'
df_inputs_prepr['total_bal_ex_mort_to_inc:0-0.4'] = np.where((df_inputs_prepr['total_bal_ex_mort_to_inc'] <= 0.4), 1, 0)
df_inputs_prepr['total_bal_ex_mort_to_inc:0.4-1'] = np.where((df_inputs_prepr['total_bal_ex_mort_to_inc'] > 0.4) & (df_inputs_prepr['total_bal_ex_mort_to_inc'] <= 1.), 1, 0)
df_inputs_prepr['total_bal_ex_mort_to_inc:1-2.6'] = np.where((df_inputs_prepr['total_bal_ex_mort_to_inc'] > 1.) & (df_inputs_prepr['total_bal_ex_mort_to_inc'] <= 2.6), 1, 0)
df_inputs_prepr['total_bal_ex_mort_to_inc:2.6-4.4'] = np.where((df_inputs_prepr['total_bal_ex_mort_to_inc'] > 2.6) & (df_inputs_prepr['total_bal_ex_mort_to_inc'] <= 4.4), 1, 0)
df_inputs_prepr['total_bal_ex_mort_to_inc:>4.4'] = np.where((df_inputs_prepr['total_bal_ex_mort_to_inc'] > 4.4), 1, 0)
Variable: 'total_balance_to_credit_ratio'¶
# unique values
df_inputs_prepr['total_balance_to_credit_ratio'].nunique()
259985
# 'total_balance_to_credit_ratio'
# the categories of everyone with 'total_balance_to_credit_ratio' less or equal 2.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['total_balance_to_credit_ratio'] <= 2., : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['total_balance_to_credit_ratio_factor'] = pd.cut(df_inputs_prepr_temp['total_balance_to_credit_ratio'], 40)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'total_balance_to_credit_ratio_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1020104122.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['total_balance_to_credit_ratio_factor'] = pd.cut(df_inputs_prepr_temp['total_balance_to_credit_ratio'], 40) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| total_balance_to_credit_ratio_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.00197, 0.0491] | 629 | 0.243243 | 0.002294 | 153.0 | 476.0 | 0.002602 | 0.002210 | 0.778251 | NaN | NaN | inf |
| 1 | (0.0491, 0.0983] | 725 | 0.177931 | 0.002644 | 129.0 | 596.0 | 0.002194 | 0.002767 | 0.583896 | 0.065312 | 0.194355 | inf |
| 2 | (0.0983, 0.147] | 1004 | 0.184263 | 0.003661 | 185.0 | 819.0 | 0.003147 | 0.003802 | 0.603007 | 0.006332 | 0.019111 | inf |
| 3 | (0.147, 0.197] | 1623 | 0.175601 | 0.005919 | 285.0 | 1338.0 | 0.004847 | 0.006211 | 0.576845 | 0.008662 | 0.026161 | inf |
| 4 | (0.197, 0.246] | 2306 | 0.185169 | 0.008410 | 427.0 | 1879.0 | 0.007263 | 0.008723 | 0.605736 | 0.009568 | 0.028891 | inf |
| 5 | (0.246, 0.295] | 3343 | 0.178881 | 0.012191 | 598.0 | 2745.0 | 0.010171 | 0.012743 | 0.586768 | 0.006288 | 0.018968 | inf |
| 6 | (0.295, 0.344] | 4461 | 0.194351 | 0.016269 | 867.0 | 3594.0 | 0.014746 | 0.016684 | 0.633315 | 0.015470 | 0.046547 | inf |
| 7 | (0.344, 0.393] | 5882 | 0.202992 | 0.021451 | 1194.0 | 4688.0 | 0.020308 | 0.021763 | 0.659152 | 0.008641 | 0.025836 | inf |
| 8 | (0.393, 0.442] | 7282 | 0.204065 | 0.026556 | 1486.0 | 5796.0 | 0.025274 | 0.026906 | 0.662351 | 0.001073 | 0.003200 | inf |
| 9 | (0.442, 0.491] | 9184 | 0.222561 | 0.033493 | 2044.0 | 7140.0 | 0.034765 | 0.033145 | 0.717284 | 0.018496 | 0.054933 | inf |
| 10 | (0.491, 0.54] | 11099 | 0.227408 | 0.040476 | 2524.0 | 8575.0 | 0.042929 | 0.039807 | 0.731611 | 0.004847 | 0.014327 | inf |
| 11 | (0.54, 0.59] | 13177 | 0.232754 | 0.048054 | 3067.0 | 10110.0 | 0.052164 | 0.046933 | 0.747385 | 0.005346 | 0.015774 | inf |
| 12 | (0.59, 0.639] | 15165 | 0.231916 | 0.055304 | 3517.0 | 11648.0 | 0.059818 | 0.054072 | 0.744913 | 0.000838 | 0.002472 | inf |
| 13 | (0.639, 0.688] | 17992 | 0.240218 | 0.065614 | 4322.0 | 13670.0 | 0.073510 | 0.063459 | 0.769359 | 0.008302 | 0.024446 | inf |
| 14 | (0.688, 0.737] | 33716 | 0.200172 | 0.122957 | 6749.0 | 26967.0 | 0.114789 | 0.125186 | 0.650732 | 0.040046 | 0.118627 | inf |
| 15 | (0.737, 0.786] | 24314 | 0.225179 | 0.088669 | 5475.0 | 18839.0 | 0.093120 | 0.087454 | 0.725026 | 0.025007 | 0.074294 | inf |
| 16 | (0.786, 0.835] | 29154 | 0.208376 | 0.106320 | 6075.0 | 23079.0 | 0.103325 | 0.107137 | 0.675195 | 0.016803 | 0.049830 | inf |
| 17 | (0.835, 0.884] | 34777 | 0.205682 | 0.126826 | 7153.0 | 27624.0 | 0.121660 | 0.128236 | 0.667172 | 0.002694 | 0.008024 | inf |
| 18 | (0.884, 0.934] | 34197 | 0.208235 | 0.124711 | 7121.0 | 27076.0 | 0.121116 | 0.125692 | 0.674774 | 0.002553 | 0.007602 | inf |
| 19 | (0.934, 0.983] | 18221 | 0.214697 | 0.066449 | 3912.0 | 14309.0 | 0.066536 | 0.066425 | 0.693982 | 0.006463 | 0.019208 | inf |
| 20 | (0.983, 1.032] | 3094 | 0.250808 | 0.011283 | 776.0 | 2318.0 | 0.013198 | 0.010761 | 0.800452 | 0.036111 | 0.106469 | inf |
| 21 | (1.032, 1.081] | 1215 | 0.262551 | 0.004431 | 319.0 | 896.0 | 0.005426 | 0.004159 | 0.834830 | 0.011743 | 0.034379 | inf |
| 22 | (1.081, 1.13] | 671 | 0.256334 | 0.002447 | 172.0 | 499.0 | 0.002925 | 0.002316 | 0.816640 | 0.006218 | 0.018190 | inf |
| 23 | (1.13, 1.179] | 391 | 0.232737 | 0.001426 | 91.0 | 300.0 | 0.001548 | 0.001393 | 0.747333 | 0.023597 | 0.069307 | inf |
| 24 | (1.179, 1.228] | 200 | 0.255000 | 0.000729 | 51.0 | 149.0 | 0.000867 | 0.000692 | 0.812734 | 0.022263 | 0.065401 | inf |
| 25 | (1.228, 1.277] | 158 | 0.253165 | 0.000576 | 40.0 | 118.0 | 0.000680 | 0.000548 | 0.807358 | 0.001835 | 0.005376 | inf |
| 26 | (1.277, 1.327] | 67 | 0.238806 | 0.000244 | 16.0 | 51.0 | 0.000272 | 0.000237 | 0.765206 | 0.014359 | 0.042152 | inf |
| 27 | (1.327, 1.376] | 48 | 0.291667 | 0.000175 | 14.0 | 34.0 | 0.000238 | 0.000158 | 0.919739 | 0.052861 | 0.154533 | inf |
| 28 | (1.376, 1.425] | 27 | 0.333333 | 0.000098 | 9.0 | 18.0 | 0.000153 | 0.000084 | 1.040954 | 0.041667 | 0.121214 | inf |
| 29 | (1.425, 1.474] | 29 | 0.310345 | 0.000106 | 9.0 | 20.0 | 0.000153 | 0.000093 | 0.974078 | 0.022989 | 0.066875 | inf |
| 30 | (1.474, 1.523] | 24 | 0.208333 | 0.000088 | 5.0 | 19.0 | 0.000085 | 0.000088 | 0.675068 | 0.102011 | 0.299010 | inf |
| 31 | (1.523, 1.572] | 12 | 0.250000 | 0.000044 | 3.0 | 9.0 | 0.000051 | 0.000042 | 0.798082 | 0.041667 | 0.123015 | inf |
| 32 | (1.572, 1.621] | 5 | 0.600000 | 0.000018 | 3.0 | 2.0 | 0.000051 | 0.000009 | 1.871148 | 0.350000 | 1.073065 | inf |
| 33 | (1.621, 1.671] | 2 | 1.000000 | 0.000007 | 2.0 | 0.0 | 0.000034 | 0.000000 | inf | 0.400000 | inf | inf |
| 34 | (1.671, 1.72] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 1.000000 | inf | inf |
| 35 | (1.72, 1.769] | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.000000 | 0.000000 | inf |
| 36 | (1.769, 1.818] | 3 | 0.333333 | 0.000011 | 1.0 | 2.0 | 0.000017 | 0.000009 | 1.040954 | 0.333333 | 1.040954 | inf |
| 37 | (1.818, 1.867] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.333333 | 1.040954 | inf |
| 38 | (1.867, 1.916] | 4 | 0.000000 | 0.000015 | 0.0 | 4.0 | 0.000000 | 0.000019 | 0.000000 | 0.000000 | 0.000000 | inf |
| 39 | (1.916, 1.965] | 3 | 0.333333 | 0.000011 | 1.0 | 2.0 | 0.000017 | 0.000009 | 1.040954 | 0.333333 | 1.040954 | inf |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '< 0.05', '0.05 - 0.2', '0.2 - 0.4', '0.4 - 0.7', '0.7 - 1.', '1. - 1.4', '> 1.4'
df_inputs_prepr['total_balance_to_credit_ratio:0-0.05'] = np.where((df_inputs_prepr['total_balance_to_credit_ratio'] <= 0.05), 1, 0)
df_inputs_prepr['total_balance_to_credit_ratio:0.05-0.2'] = np.where((df_inputs_prepr['total_balance_to_credit_ratio'] > 0.05) & (df_inputs_prepr['total_balance_to_credit_ratio'] <= 0.2), 1, 0)
df_inputs_prepr['total_balance_to_credit_ratio:0.2-0.4'] = np.where((df_inputs_prepr['total_balance_to_credit_ratio'] > 0.2) & (df_inputs_prepr['total_balance_to_credit_ratio'] <= 0.4), 1, 0)
df_inputs_prepr['total_balance_to_credit_ratio:0.4-0.7'] = np.where((df_inputs_prepr['total_balance_to_credit_ratio'] > 0.4) & (df_inputs_prepr['total_balance_to_credit_ratio'] <= 0.7), 1, 0)
df_inputs_prepr['total_balance_to_credit_ratio:0.7-1'] = np.where((df_inputs_prepr['total_balance_to_credit_ratio'] > 0.7) & (df_inputs_prepr['total_balance_to_credit_ratio'] <= 1.), 1, 0)
df_inputs_prepr['total_balance_to_credit_ratio:1-1.4'] = np.where((df_inputs_prepr['total_balance_to_credit_ratio'] > 1.) & (df_inputs_prepr['total_balance_to_credit_ratio'] <= 1.4), 1, 0)
df_inputs_prepr['total_balance_to_credit_ratio:>1.4'] = np.where((df_inputs_prepr['total_balance_to_credit_ratio'] > 1.4), 1, 0)
Variable: 'rev_to_il_limit_ratio'¶
# unique values
df_inputs_prepr['rev_to_il_limit_ratio'].nunique()
215869
# 'rev_to_il_limit_ratio'
# the categories of everyone with 'rev_to_il_limit_ratio' less or equal 2.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['rev_to_il_limit_ratio'] <= 10., : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['rev_to_il_limit_ratio_factor'] = pd.cut(df_inputs_prepr_temp['rev_to_il_limit_ratio'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'rev_to_il_limit_ratio_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2198525158.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['rev_to_il_limit_ratio_factor'] = pd.cut(df_inputs_prepr_temp['rev_to_il_limit_ratio'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| rev_to_il_limit_ratio_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.01, 0.2] | 27768 | 0.245534 | 0.115475 | 6818.0 | 20950.0 | 0.130661 | 0.111267 | 0.776706 | NaN | NaN | 0.005387 |
| 1 | (0.2, 0.4] | 42419 | 0.233575 | 0.176403 | 9908.0 | 32511.0 | 0.189878 | 0.172668 | 0.741779 | 0.011960 | 0.034927 | 0.005387 |
| 2 | (0.4, 0.6] | 34828 | 0.224302 | 0.144835 | 7812.0 | 27016.0 | 0.149710 | 0.143484 | 0.714610 | 0.009272 | 0.027169 | 0.005387 |
| 3 | (0.6, 0.8] | 39273 | 0.200392 | 0.163320 | 7870.0 | 31403.0 | 0.150821 | 0.166784 | 0.644111 | 0.023910 | 0.070499 | 0.005387 |
| 4 | (0.8, 1.0] | 18747 | 0.212834 | 0.077961 | 3990.0 | 14757.0 | 0.076465 | 0.078375 | 0.680882 | 0.012442 | 0.036771 | 0.005387 |
| 5 | (1.0, 1.2] | 13893 | 0.217520 | 0.057775 | 3022.0 | 10871.0 | 0.057914 | 0.057737 | 0.694680 | 0.004686 | 0.013798 | 0.005387 |
| 6 | (1.2, 1.4] | 10535 | 0.216421 | 0.043811 | 2280.0 | 8255.0 | 0.043694 | 0.043843 | 0.691449 | 0.001098 | 0.003232 | 0.005387 |
| 7 | (1.4, 1.6] | 8102 | 0.198470 | 0.033693 | 1608.0 | 6494.0 | 0.030816 | 0.034490 | 0.638410 | 0.017952 | 0.053038 | 0.005387 |
| 8 | (1.6, 1.8] | 6542 | 0.201009 | 0.027205 | 1315.0 | 5227.0 | 0.025201 | 0.027761 | 0.645938 | 0.002539 | 0.007528 | 0.005387 |
| 9 | (1.8, 2.0] | 5332 | 0.198612 | 0.022174 | 1059.0 | 4273.0 | 0.020295 | 0.022694 | 0.638834 | 0.002397 | 0.007105 | 0.005387 |
| 10 | (2.0, 2.2] | 4284 | 0.207049 | 0.017815 | 887.0 | 3397.0 | 0.016999 | 0.018042 | 0.663811 | 0.008437 | 0.024977 | 0.005387 |
| 11 | (2.2, 2.4] | 3594 | 0.200612 | 0.014946 | 721.0 | 2873.0 | 0.013817 | 0.015259 | 0.644763 | 0.006437 | 0.019048 | 0.005387 |
| 12 | (2.4, 2.6] | 2962 | 0.194801 | 0.012318 | 577.0 | 2385.0 | 0.011058 | 0.012667 | 0.627519 | 0.005811 | 0.017244 | 0.005387 |
| 13 | (2.6, 2.8] | 2597 | 0.198691 | 0.010800 | 516.0 | 2081.0 | 0.009889 | 0.011052 | 0.639067 | 0.003890 | 0.011548 | 0.005387 |
| 14 | (2.8, 3.0] | 2204 | 0.195554 | 0.009165 | 431.0 | 1773.0 | 0.008260 | 0.009417 | 0.629755 | 0.003137 | 0.009312 | 0.005387 |
| 15 | (3.0, 3.2] | 1923 | 0.197608 | 0.007997 | 380.0 | 1543.0 | 0.007282 | 0.008195 | 0.635854 | 0.002054 | 0.006099 | 0.005387 |
| 16 | (3.2, 3.4] | 1721 | 0.202789 | 0.007157 | 349.0 | 1372.0 | 0.006688 | 0.007287 | 0.651211 | 0.005181 | 0.015356 | 0.005387 |
| 17 | (3.4, 3.6] | 1404 | 0.175214 | 0.005839 | 246.0 | 1158.0 | 0.004714 | 0.006150 | 0.569020 | 0.027575 | 0.082190 | 0.005387 |
| 18 | (3.6, 3.8] | 1257 | 0.183771 | 0.005227 | 231.0 | 1026.0 | 0.004427 | 0.005449 | 0.594652 | 0.008557 | 0.025632 | 0.005387 |
| 19 | (3.8, 4.0] | 1087 | 0.226311 | 0.004520 | 246.0 | 841.0 | 0.004714 | 0.004467 | 0.720503 | 0.042540 | 0.125851 | 0.005387 |
| 20 | (4.0, 4.2] | 967 | 0.193382 | 0.004021 | 187.0 | 780.0 | 0.003584 | 0.004143 | 0.623300 | 0.032929 | 0.097203 | 0.005387 |
| 21 | (4.2, 4.4] | 814 | 0.189189 | 0.003385 | 154.0 | 660.0 | 0.002951 | 0.003505 | 0.610821 | 0.004192 | 0.012479 | 0.005387 |
| 22 | (4.4, 4.6] | 777 | 0.199485 | 0.003231 | 155.0 | 622.0 | 0.002970 | 0.003303 | 0.641423 | 0.010296 | 0.030602 | 0.005387 |
| 23 | (4.6, 4.8] | 657 | 0.203957 | 0.002732 | 134.0 | 523.0 | 0.002568 | 0.002778 | 0.654668 | 0.004472 | 0.013246 | 0.005387 |
| 24 | (4.8, 5.0] | 638 | 0.167712 | 0.002653 | 107.0 | 531.0 | 0.002051 | 0.002820 | 0.546444 | 0.036246 | 0.108224 | 0.005387 |
| 25 | (5.0, 5.2] | 565 | 0.221239 | 0.002350 | 125.0 | 440.0 | 0.002396 | 0.002337 | 0.705615 | 0.053527 | 0.159171 | 0.005387 |
| 26 | (5.2, 5.4] | 517 | 0.185687 | 0.002150 | 96.0 | 421.0 | 0.001840 | 0.002236 | 0.600374 | 0.035552 | 0.105241 | 0.005387 |
| 27 | (5.4, 5.6] | 454 | 0.162996 | 0.001888 | 74.0 | 380.0 | 0.001418 | 0.002018 | 0.532200 | 0.022691 | 0.068174 | 0.005387 |
| 28 | (5.6, 5.8] | 422 | 0.156398 | 0.001755 | 66.0 | 356.0 | 0.001265 | 0.001891 | 0.512200 | 0.006597 | 0.020000 | 0.005387 |
| 29 | (5.8, 6.0] | 375 | 0.192000 | 0.001559 | 72.0 | 303.0 | 0.001380 | 0.001609 | 0.619190 | 0.035602 | 0.106990 | 0.005387 |
| 30 | (6.0, 6.2] | 384 | 0.208333 | 0.001597 | 80.0 | 304.0 | 0.001533 | 0.001615 | 0.667603 | 0.016333 | 0.048413 | 0.005387 |
| 31 | (6.2, 6.4] | 352 | 0.210227 | 0.001464 | 74.0 | 278.0 | 0.001418 | 0.001476 | 0.673194 | 0.001894 | 0.005591 | 0.005387 |
| 32 | (6.4, 6.6] | 281 | 0.192171 | 0.001169 | 54.0 | 227.0 | 0.001035 | 0.001206 | 0.619699 | 0.018056 | 0.053495 | 0.005387 |
| 33 | (6.6, 6.8] | 265 | 0.218868 | 0.001102 | 58.0 | 207.0 | 0.001112 | 0.001099 | 0.698646 | 0.026697 | 0.078947 | 0.005387 |
| 34 | (6.8, 7.0] | 240 | 0.225000 | 0.000998 | 54.0 | 186.0 | 0.001035 | 0.000988 | 0.716658 | 0.006132 | 0.018012 | 0.005387 |
| 35 | (7.0, 7.2] | 218 | 0.211009 | 0.000907 | 46.0 | 172.0 | 0.000882 | 0.000914 | 0.675501 | 0.013991 | 0.041157 | 0.005387 |
| 36 | (7.2, 7.4] | 212 | 0.179245 | 0.000882 | 38.0 | 174.0 | 0.000728 | 0.000924 | 0.581112 | 0.031764 | 0.094389 | 0.005387 |
| 37 | (7.4, 7.6] | 195 | 0.200000 | 0.000811 | 39.0 | 156.0 | 0.000747 | 0.000829 | 0.642949 | 0.020755 | 0.061837 | 0.005387 |
| 38 | (7.6, 7.8] | 203 | 0.177340 | 0.000844 | 36.0 | 167.0 | 0.000690 | 0.000887 | 0.575401 | 0.022660 | 0.067548 | 0.005387 |
| 39 | (7.8, 8.0] | 196 | 0.117347 | 0.000815 | 23.0 | 173.0 | 0.000441 | 0.000919 | 0.391853 | 0.059993 | 0.183548 | 0.005387 |
| 40 | (8.0, 8.2] | 156 | 0.198718 | 0.000649 | 31.0 | 125.0 | 0.000594 | 0.000664 | 0.639147 | 0.081371 | 0.247295 | 0.005387 |
| 41 | (8.2, 8.4] | 151 | 0.178808 | 0.000628 | 27.0 | 124.0 | 0.000517 | 0.000659 | 0.579801 | 0.019910 | 0.059346 | 0.005387 |
| 42 | (8.4, 8.6] | 141 | 0.156028 | 0.000586 | 22.0 | 119.0 | 0.000422 | 0.000632 | 0.511077 | 0.022780 | 0.068725 | 0.005387 |
| 43 | (8.6, 8.8] | 159 | 0.194969 | 0.000661 | 31.0 | 128.0 | 0.000594 | 0.000680 | 0.628017 | 0.038940 | 0.116940 | 0.005387 |
| 44 | (8.8, 9.0] | 123 | 0.154472 | 0.000512 | 19.0 | 104.0 | 0.000364 | 0.000552 | 0.506344 | 0.040497 | 0.121674 | 0.005387 |
| 45 | (9.0, 9.2] | 101 | 0.217822 | 0.000420 | 22.0 | 79.0 | 0.000422 | 0.000420 | 0.695569 | 0.063350 | 0.189226 | 0.005387 |
| 46 | (9.2, 9.4] | 123 | 0.178862 | 0.000512 | 22.0 | 101.0 | 0.000422 | 0.000536 | 0.579963 | 0.038960 | 0.115607 | 0.005387 |
| 47 | (9.4, 9.6] | 123 | 0.260163 | 0.000512 | 32.0 | 91.0 | 0.000613 | 0.000483 | 0.819278 | 0.081301 | 0.239315 | 0.005387 |
| 48 | (9.6, 9.8] | 106 | 0.207547 | 0.000441 | 22.0 | 84.0 | 0.000422 | 0.000446 | 0.665281 | 0.052615 | 0.153997 | 0.005387 |
| 49 | (9.8, 10.0] | 80 | 0.187500 | 0.000333 | 15.0 | 65.0 | 0.000287 | 0.000345 | 0.605785 | 0.020047 | 0.059496 | 0.005387 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '< 0.6', '0.6 - 0.8', '0.8 - 1.8', '1.8 - 4.5', '4.5 - 10.', '> 10.'
df_inputs_prepr['rev_to_il_limit_ratio:0-0.6'] = np.where((df_inputs_prepr['rev_to_il_limit_ratio'] <= 0.6), 1, 0)
df_inputs_prepr['rev_to_il_limit_ratio:0.6-0.8'] = np.where((df_inputs_prepr['rev_to_il_limit_ratio'] > 0.6) & (df_inputs_prepr['rev_to_il_limit_ratio'] <= 0.8), 1, 0)
df_inputs_prepr['rev_to_il_limit_ratio:0.8-1.8'] = np.where((df_inputs_prepr['rev_to_il_limit_ratio'] > 0.8) & (df_inputs_prepr['rev_to_il_limit_ratio'] <= 1.8), 1, 0)
df_inputs_prepr['rev_to_il_limit_ratio:1.8-4.5'] = np.where((df_inputs_prepr['rev_to_il_limit_ratio'] > 1.8) & (df_inputs_prepr['rev_to_il_limit_ratio'] <= 4.5), 1, 0)
df_inputs_prepr['rev_to_il_limit_ratio:4.5-10'] = np.where((df_inputs_prepr['rev_to_il_limit_ratio'] > 4.5) & (df_inputs_prepr['rev_to_il_limit_ratio'] <= 10.), 1, 0)
df_inputs_prepr['rev_to_il_limit_ratio:>10.'] = np.where((df_inputs_prepr['rev_to_il_limit_ratio'] > 10.), 1, 0)
Variable: 'total_il_high_credit_limit'¶
# unique values
df_inputs_prepr['total_il_high_credit_limit'].nunique()
94012
# 'rev_to_il_limit_ratio'
# the categories of everyone with 'rev_to_il_limit_ratio' less or equal 2.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['total_il_high_credit_limit'] <= 250000., : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['total_il_high_credit_limit_factor'] = pd.cut(df_inputs_prepr_temp['total_il_high_credit_limit'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'total_il_high_credit_limit_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1905072761.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['total_il_high_credit_limit_factor'] = pd.cut(df_inputs_prepr_temp['total_il_high_credit_limit'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| total_il_high_credit_limit_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-249.996, 4999.92] | 36285 | 0.203087 | 0.132928 | 7369.0 | 28916.0 | 0.125820 | 0.134869 | 0.659021 | NaN | NaN | 0.00276 |
| 1 | (4999.92, 9999.84] | 12562 | 0.230457 | 0.046020 | 2895.0 | 9667.0 | 0.049430 | 0.045089 | 0.740164 | 0.027370 | 0.081143 | 0.00276 |
| 2 | (9999.84, 14999.76] | 16870 | 0.220036 | 0.061802 | 3712.0 | 13158.0 | 0.063379 | 0.061371 | 0.709375 | 0.010421 | 0.030789 | 0.00276 |
| 3 | (14999.76, 19999.68] | 18793 | 0.212792 | 0.068847 | 3999.0 | 14794.0 | 0.068280 | 0.069002 | 0.687900 | 0.007244 | 0.021475 | 0.00276 |
| 4 | (19999.68, 24999.6] | 20033 | 0.219588 | 0.073390 | 4399.0 | 15634.0 | 0.075109 | 0.072920 | 0.708049 | 0.006796 | 0.020149 | 0.00276 |
| 5 | (24999.6, 29999.52] | 19362 | 0.224357 | 0.070931 | 4344.0 | 15018.0 | 0.074170 | 0.070047 | 0.722157 | 0.004769 | 0.014108 | 0.00276 |
| 6 | (29999.52, 34999.44] | 30728 | 0.189827 | 0.112570 | 5833.0 | 24895.0 | 0.099594 | 0.116115 | 0.619349 | 0.034530 | 0.102808 | 0.00276 |
| 7 | (34999.44, 39999.36] | 15447 | 0.224315 | 0.056589 | 3465.0 | 11982.0 | 0.059162 | 0.055886 | 0.722034 | 0.034489 | 0.102685 | 0.00276 |
| 8 | (39999.36, 44999.28] | 13698 | 0.219156 | 0.050182 | 3002.0 | 10696.0 | 0.051257 | 0.049888 | 0.706771 | 0.005159 | 0.015263 | 0.00276 |
| 9 | (44999.28, 49999.2] | 11906 | 0.227448 | 0.043617 | 2708.0 | 9198.0 | 0.046237 | 0.042901 | 0.731288 | 0.008292 | 0.024517 | 0.00276 |
| 10 | (49999.2, 54999.12] | 10240 | 0.223535 | 0.037514 | 2289.0 | 7951.0 | 0.039083 | 0.037085 | 0.719727 | 0.003913 | 0.011560 | 0.00276 |
| 11 | (54999.12, 59999.04] | 8855 | 0.223264 | 0.032440 | 1977.0 | 6878.0 | 0.033756 | 0.032080 | 0.718925 | 0.000271 | 0.000803 | 0.00276 |
| 12 | (59999.04, 64998.96] | 7636 | 0.220272 | 0.027974 | 1682.0 | 5954.0 | 0.028719 | 0.027771 | 0.710076 | 0.002991 | 0.008849 | 0.00276 |
| 13 | (64998.96, 69998.88] | 6648 | 0.213448 | 0.024355 | 1419.0 | 5229.0 | 0.024228 | 0.024389 | 0.689846 | 0.006825 | 0.020229 | 0.00276 |
| 14 | (69998.88, 74998.8] | 5637 | 0.221217 | 0.020651 | 1247.0 | 4390.0 | 0.021291 | 0.020476 | 0.712871 | 0.007769 | 0.023025 | 0.00276 |
| 15 | (74998.8, 79998.72] | 4902 | 0.214810 | 0.017958 | 1053.0 | 3849.0 | 0.017979 | 0.017952 | 0.693890 | 0.006407 | 0.018981 | 0.00276 |
| 16 | (79998.72, 84998.64] | 4278 | 0.218326 | 0.015672 | 934.0 | 3344.0 | 0.015947 | 0.015597 | 0.704313 | 0.003516 | 0.010423 | 0.00276 |
| 17 | (84998.64, 89998.56] | 3714 | 0.222940 | 0.013606 | 828.0 | 2886.0 | 0.014137 | 0.013461 | 0.717968 | 0.004614 | 0.013655 | 0.00276 |
| 18 | (89998.56, 94998.48] | 3089 | 0.220783 | 0.011316 | 682.0 | 2407.0 | 0.011645 | 0.011227 | 0.711588 | 0.002157 | 0.006380 | 0.00276 |
| 19 | (94998.48, 99998.4] | 2642 | 0.228993 | 0.009679 | 605.0 | 2037.0 | 0.010330 | 0.009501 | 0.735847 | 0.008210 | 0.024258 | 0.00276 |
| 20 | (99998.4, 104998.32] | 2358 | 0.224343 | 0.008638 | 529.0 | 1829.0 | 0.009032 | 0.008531 | 0.722114 | 0.004651 | 0.013732 | 0.00276 |
| 21 | (104998.32, 109998.24] | 2078 | 0.207411 | 0.007613 | 431.0 | 1647.0 | 0.007359 | 0.007682 | 0.671904 | 0.016932 | 0.050210 | 0.00276 |
| 22 | (109998.24, 114998.16] | 1749 | 0.205260 | 0.006407 | 359.0 | 1390.0 | 0.006130 | 0.006483 | 0.665499 | 0.002151 | 0.006404 | 0.00276 |
| 23 | (114998.16, 119998.08] | 1516 | 0.207784 | 0.005554 | 315.0 | 1201.0 | 0.005378 | 0.005602 | 0.673013 | 0.002523 | 0.007514 | 0.00276 |
| 24 | (119998.08, 124998.0] | 1332 | 0.209459 | 0.004880 | 279.0 | 1053.0 | 0.004764 | 0.004911 | 0.677998 | 0.001676 | 0.004985 | 0.00276 |
| 25 | (124998.0, 129997.92] | 1208 | 0.217715 | 0.004425 | 263.0 | 945.0 | 0.004491 | 0.004408 | 0.702503 | 0.008256 | 0.024505 | 0.00276 |
| 26 | (129997.92, 134997.84] | 1046 | 0.204589 | 0.003832 | 214.0 | 832.0 | 0.003654 | 0.003881 | 0.663499 | 0.013126 | 0.039003 | 0.00276 |
| 27 | (134997.84, 139997.76] | 922 | 0.233189 | 0.003378 | 215.0 | 707.0 | 0.003671 | 0.003298 | 0.748216 | 0.028600 | 0.084716 | 0.00276 |
| 28 | (139997.76, 144997.68] | 806 | 0.224566 | 0.002953 | 181.0 | 625.0 | 0.003090 | 0.002915 | 0.722774 | 0.008623 | 0.025442 | 0.00276 |
| 29 | (144997.68, 149997.6] | 740 | 0.195946 | 0.002711 | 145.0 | 595.0 | 0.002476 | 0.002775 | 0.637689 | 0.028620 | 0.085084 | 0.00276 |
| 30 | (149997.6, 154997.52] | 678 | 0.228614 | 0.002484 | 155.0 | 523.0 | 0.002646 | 0.002439 | 0.734727 | 0.032668 | 0.097037 | 0.00276 |
| 31 | (154997.52, 159997.44] | 579 | 0.231434 | 0.002121 | 134.0 | 445.0 | 0.002288 | 0.002076 | 0.743043 | 0.002820 | 0.008317 | 0.00276 |
| 32 | (159997.44, 164997.36] | 518 | 0.198842 | 0.001898 | 103.0 | 415.0 | 0.001759 | 0.001936 | 0.646349 | 0.032592 | 0.096694 | 0.00276 |
| 33 | (164997.36, 169997.28] | 465 | 0.212903 | 0.001703 | 99.0 | 366.0 | 0.001690 | 0.001707 | 0.688230 | 0.014062 | 0.041881 | 0.00276 |
| 34 | (169997.28, 174997.2] | 437 | 0.173913 | 0.001601 | 76.0 | 361.0 | 0.001298 | 0.001684 | 0.571360 | 0.038990 | 0.116870 | 0.00276 |
| 35 | (174997.2, 179997.12] | 398 | 0.211055 | 0.001458 | 84.0 | 314.0 | 0.001434 | 0.001465 | 0.682741 | 0.037142 | 0.111381 | 0.00276 |
| 36 | (179997.12, 184997.04] | 339 | 0.182891 | 0.001242 | 62.0 | 277.0 | 0.001059 | 0.001292 | 0.598486 | 0.028164 | 0.084255 | 0.00276 |
| 37 | (184997.04, 189996.96] | 358 | 0.217877 | 0.001312 | 78.0 | 280.0 | 0.001332 | 0.001306 | 0.702982 | 0.034986 | 0.104496 | 0.00276 |
| 38 | (189996.96, 194996.88] | 278 | 0.154676 | 0.001018 | 43.0 | 235.0 | 0.000734 | 0.001096 | 0.512722 | 0.063201 | 0.190260 | 0.00276 |
| 39 | (194996.88, 199996.8] | 278 | 0.241007 | 0.001018 | 67.0 | 211.0 | 0.001144 | 0.000984 | 0.771220 | 0.086331 | 0.258498 | 0.00276 |
| 40 | (199996.8, 204996.72] | 240 | 0.162500 | 0.000879 | 39.0 | 201.0 | 0.000666 | 0.000937 | 0.536660 | 0.078507 | 0.234560 | 0.00276 |
| 41 | (204996.72, 209996.64] | 184 | 0.184783 | 0.000674 | 34.0 | 150.0 | 0.000581 | 0.000700 | 0.604184 | 0.022283 | 0.067524 | 0.00276 |
| 42 | (209996.64, 214996.56] | 202 | 0.198020 | 0.000740 | 40.0 | 162.0 | 0.000683 | 0.000756 | 0.643892 | 0.013237 | 0.039708 | 0.00276 |
| 43 | (214996.56, 219996.48] | 178 | 0.213483 | 0.000652 | 38.0 | 140.0 | 0.000649 | 0.000653 | 0.689952 | 0.015463 | 0.046059 | 0.00276 |
| 44 | (219996.48, 224996.4] | 157 | 0.242038 | 0.000575 | 38.0 | 119.0 | 0.000649 | 0.000555 | 0.774249 | 0.028555 | 0.084298 | 0.00276 |
| 45 | (224996.4, 229996.32] | 123 | 0.186992 | 0.000451 | 23.0 | 100.0 | 0.000393 | 0.000466 | 0.610831 | 0.055046 | 0.163418 | 0.00276 |
| 46 | (229996.32, 234996.24] | 133 | 0.187970 | 0.000487 | 25.0 | 108.0 | 0.000427 | 0.000504 | 0.613771 | 0.000978 | 0.002940 | 0.00276 |
| 47 | (234996.24, 239996.16] | 123 | 0.138211 | 0.000451 | 17.0 | 106.0 | 0.000290 | 0.000494 | 0.461905 | 0.049759 | 0.151866 | 0.00276 |
| 48 | (239996.16, 244996.08] | 126 | 0.166667 | 0.000462 | 21.0 | 105.0 | 0.000359 | 0.000490 | 0.549358 | 0.028455 | 0.087453 | 0.00276 |
| 49 | (244996.08, 249996.0] | 94 | 0.202128 | 0.000344 | 19.0 | 75.0 | 0.000324 | 0.000350 | 0.656160 | 0.035461 | 0.106803 | 0.00276 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '< 5k', '5 - 10k', '10 - 30k', '30 - 35k', '35 - 100k', '> 100k'
df_inputs_prepr['total_il_high_credit_limit:0-5k'] = np.where((df_inputs_prepr['total_il_high_credit_limit'] <= 5000.), 1, 0)
df_inputs_prepr['total_il_high_credit_limit:5-10k'] = np.where((df_inputs_prepr['total_il_high_credit_limit'] > 5000.) & (df_inputs_prepr['total_il_high_credit_limit'] <= 10000.), 1, 0)
df_inputs_prepr['total_il_high_credit_limit:10-30k'] = np.where((df_inputs_prepr['total_il_high_credit_limit'] > 10000.) & (df_inputs_prepr['total_il_high_credit_limit'] <= 30000.), 1, 0)
df_inputs_prepr['total_il_high_credit_limit:30-35k'] = np.where((df_inputs_prepr['total_il_high_credit_limit'] > 30000.) & (df_inputs_prepr['total_il_high_credit_limit'] <= 35000.), 1, 0)
df_inputs_prepr['total_il_high_credit_limit:35-100k'] = np.where((df_inputs_prepr['total_il_high_credit_limit'] > 35000.) & (df_inputs_prepr['total_il_high_credit_limit'] <= 100000.), 1, 0)
df_inputs_prepr['total_il_high_credit_limit:>100k'] = np.where((df_inputs_prepr['total_il_high_credit_limit'] > 100000.), 1, 0)
Variable: 'tot_cur_bal'¶
# unique values
df_inputs_prepr['tot_cur_bal'].nunique()
170012
# 'tot_cur_bal'
# the categories of everyone with 'tot_cur_bal' less or equal 2.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['tot_cur_bal'] <= 500000., : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['tot_cur_bal_factor'] = pd.cut(df_inputs_prepr_temp['tot_cur_bal'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'tot_cur_bal_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1690865902.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['tot_cur_bal_factor'] = pd.cut(df_inputs_prepr_temp['tot_cur_bal'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| tot_cur_bal_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-499.976, 9999.52] | 16314 | 0.220792 | 0.061271 | 3602.0 | 12712.0 | 0.062425 | 0.060951 | 0.705166 | NaN | NaN | 0.017143 |
| 1 | (9999.52, 19999.04] | 24906 | 0.236409 | 0.093540 | 5888.0 | 19018.0 | 0.102043 | 0.091187 | 0.750969 | 0.015617 | 0.045803 | 0.017143 |
| 2 | (19999.04, 29998.56] | 25333 | 0.249595 | 0.095143 | 6323.0 | 19010.0 | 0.109582 | 0.091149 | 0.789472 | 0.013186 | 0.038503 | 0.017143 |
| 3 | (29998.56, 39998.08] | 20724 | 0.253571 | 0.077833 | 5255.0 | 15469.0 | 0.091073 | 0.074171 | 0.801053 | 0.003975 | 0.011581 | 0.017143 |
| 4 | (39998.08, 49997.6] | 15744 | 0.254764 | 0.059130 | 4011.0 | 11733.0 | 0.069514 | 0.056257 | 0.804527 | 0.001193 | 0.003473 | 0.017143 |
| 5 | (49997.6, 59997.12] | 11657 | 0.259672 | 0.043780 | 3027.0 | 8630.0 | 0.052460 | 0.041379 | 0.818808 | 0.004909 | 0.014282 | 0.017143 |
| 6 | (59997.12, 69996.64] | 8558 | 0.253681 | 0.032141 | 2171.0 | 6387.0 | 0.037625 | 0.030624 | 0.801374 | 0.005992 | 0.017435 | 0.017143 |
| 7 | (69996.64, 79996.16] | 20487 | 0.186264 | 0.076943 | 3816.0 | 16671.0 | 0.066134 | 0.079934 | 0.602872 | 0.067416 | 0.198502 | 0.017143 |
| 8 | (79996.16, 89995.68] | 6084 | 0.236029 | 0.022850 | 1436.0 | 4648.0 | 0.024887 | 0.022286 | 0.749858 | 0.049764 | 0.146985 | 0.017143 |
| 9 | (89995.68, 99995.2] | 5461 | 0.232192 | 0.020510 | 1268.0 | 4193.0 | 0.021975 | 0.020105 | 0.738625 | 0.003837 | 0.011233 | 0.017143 |
| 10 | (99995.2, 109994.72] | 5111 | 0.221287 | 0.019195 | 1131.0 | 3980.0 | 0.019601 | 0.019083 | 0.706623 | 0.010904 | 0.032002 | 0.017143 |
| 11 | (109994.72, 119994.24] | 4733 | 0.213607 | 0.017776 | 1011.0 | 3722.0 | 0.017521 | 0.017846 | 0.684005 | 0.007681 | 0.022618 | 0.017143 |
| 12 | (119994.24, 129993.76] | 4956 | 0.215093 | 0.018613 | 1066.0 | 3890.0 | 0.018475 | 0.018652 | 0.688387 | 0.001486 | 0.004382 | 0.017143 |
| 13 | (129993.76, 139993.28] | 5013 | 0.204867 | 0.018827 | 1027.0 | 3986.0 | 0.017799 | 0.019112 | 0.658184 | 0.010225 | 0.030203 | 0.017143 |
| 14 | (139993.28, 149992.8] | 4984 | 0.200642 | 0.018718 | 1000.0 | 3984.0 | 0.017331 | 0.019102 | 0.645664 | 0.004225 | 0.012520 | 0.017143 |
| 15 | (149992.8, 159992.32] | 5130 | 0.203509 | 0.019267 | 1044.0 | 4086.0 | 0.018093 | 0.019591 | 0.654161 | 0.002867 | 0.008497 | 0.017143 |
| 16 | (159992.32, 169991.84] | 5065 | 0.184995 | 0.019023 | 937.0 | 4128.0 | 0.016239 | 0.019793 | 0.599079 | 0.018514 | 0.055082 | 0.017143 |
| 17 | (169991.84, 179991.36] | 4895 | 0.197549 | 0.018384 | 967.0 | 3928.0 | 0.016759 | 0.018834 | 0.636482 | 0.012553 | 0.037403 | 0.017143 |
| 18 | (179991.36, 189990.88] | 4659 | 0.191887 | 0.017498 | 894.0 | 3765.0 | 0.015494 | 0.018052 | 0.619642 | 0.005662 | 0.016840 | 0.017143 |
| 19 | (189990.88, 199990.4] | 4589 | 0.190673 | 0.017235 | 875.0 | 3714.0 | 0.015164 | 0.017808 | 0.616027 | 0.001213 | 0.003615 | 0.017143 |
| 20 | (199990.4, 209989.92] | 4359 | 0.183758 | 0.016371 | 801.0 | 3558.0 | 0.013882 | 0.017060 | 0.595379 | 0.006916 | 0.020648 | 0.017143 |
| 21 | (209989.92, 219989.44] | 4331 | 0.189333 | 0.016266 | 820.0 | 3511.0 | 0.014211 | 0.016834 | 0.612030 | 0.005575 | 0.016651 | 0.017143 |
| 22 | (219989.44, 229988.96] | 3853 | 0.193615 | 0.014471 | 746.0 | 3107.0 | 0.012929 | 0.014897 | 0.624789 | 0.004283 | 0.012759 | 0.017143 |
| 23 | (229988.96, 239988.48] | 3717 | 0.182136 | 0.013960 | 677.0 | 3040.0 | 0.011733 | 0.014576 | 0.590527 | 0.011479 | 0.034262 | 0.017143 |
| 24 | (239988.48, 249988.0] | 3583 | 0.177505 | 0.013457 | 636.0 | 2947.0 | 0.011022 | 0.014130 | 0.576644 | 0.004631 | 0.013883 | 0.017143 |
| 25 | (249988.0, 259987.52] | 3313 | 0.187142 | 0.012443 | 620.0 | 2693.0 | 0.010745 | 0.012912 | 0.605492 | 0.009637 | 0.028848 | 0.017143 |
| 26 | (259987.52, 269987.04] | 3240 | 0.178395 | 0.012169 | 578.0 | 2662.0 | 0.010017 | 0.012764 | 0.579315 | 0.008747 | 0.026177 | 0.017143 |
| 27 | (269987.04, 279986.56] | 2970 | 0.176431 | 0.011154 | 524.0 | 2446.0 | 0.009081 | 0.011728 | 0.573419 | 0.001964 | 0.005896 | 0.017143 |
| 28 | (279986.56, 289986.08] | 2807 | 0.178839 | 0.010542 | 502.0 | 2305.0 | 0.008700 | 0.011052 | 0.580645 | 0.002408 | 0.007226 | 0.017143 |
| 29 | (289986.08, 299985.6] | 2627 | 0.176247 | 0.009866 | 463.0 | 2164.0 | 0.008024 | 0.010376 | 0.572866 | 0.002592 | 0.007780 | 0.017143 |
| 30 | (299985.6, 309985.12] | 2487 | 0.177724 | 0.009340 | 442.0 | 2045.0 | 0.007660 | 0.009805 | 0.577302 | 0.001477 | 0.004436 | 0.017143 |
| 31 | (309985.12, 319984.64] | 2357 | 0.174374 | 0.008852 | 411.0 | 1946.0 | 0.007123 | 0.009331 | 0.567238 | 0.003350 | 0.010064 | 0.017143 |
| 32 | (319984.64, 329984.16] | 2105 | 0.163420 | 0.007906 | 344.0 | 1761.0 | 0.005962 | 0.008444 | 0.534192 | 0.010954 | 0.033047 | 0.017143 |
| 33 | (329984.16, 339983.68] | 1936 | 0.169421 | 0.007271 | 328.0 | 1608.0 | 0.005684 | 0.007710 | 0.552324 | 0.006001 | 0.018132 | 0.017143 |
| 34 | (339983.68, 349983.2] | 1862 | 0.178840 | 0.006993 | 333.0 | 1529.0 | 0.005771 | 0.007331 | 0.580649 | 0.009418 | 0.028326 | 0.017143 |
| 35 | (349983.2, 359982.72] | 1731 | 0.160601 | 0.006501 | 278.0 | 1453.0 | 0.004818 | 0.006967 | 0.525648 | 0.018239 | 0.055001 | 0.017143 |
| 36 | (359982.72, 369982.24] | 1626 | 0.170357 | 0.006107 | 277.0 | 1349.0 | 0.004801 | 0.006468 | 0.555143 | 0.009756 | 0.029495 | 0.017143 |
| 37 | (369982.24, 379981.76] | 1472 | 0.163043 | 0.005528 | 240.0 | 1232.0 | 0.004159 | 0.005907 | 0.533050 | 0.007313 | 0.022093 | 0.017143 |
| 38 | (379981.76, 389981.28] | 1365 | 0.170696 | 0.005127 | 233.0 | 1132.0 | 0.004038 | 0.005428 | 0.556166 | 0.007652 | 0.023116 | 0.017143 |
| 39 | (389981.28, 399980.8] | 1335 | 0.168539 | 0.005014 | 225.0 | 1110.0 | 0.003899 | 0.005322 | 0.549662 | 0.002157 | 0.006503 | 0.017143 |
| 40 | (399980.8, 409980.32] | 1245 | 0.167871 | 0.004676 | 209.0 | 1036.0 | 0.003622 | 0.004967 | 0.547647 | 0.000668 | 0.002016 | 0.017143 |
| 41 | (409980.32, 419979.84] | 1136 | 0.161092 | 0.004266 | 183.0 | 953.0 | 0.003172 | 0.004569 | 0.527136 | 0.006780 | 0.020510 | 0.017143 |
| 42 | (419979.84, 429979.36] | 1030 | 0.170874 | 0.003868 | 176.0 | 854.0 | 0.003050 | 0.004095 | 0.556702 | 0.009782 | 0.029565 | 0.017143 |
| 43 | (429979.36, 439978.88] | 979 | 0.164454 | 0.003677 | 161.0 | 818.0 | 0.002790 | 0.003922 | 0.537318 | 0.006420 | 0.019384 | 0.017143 |
| 44 | (439978.88, 449978.4] | 909 | 0.155116 | 0.003414 | 141.0 | 768.0 | 0.002444 | 0.003682 | 0.508983 | 0.009338 | 0.028335 | 0.017143 |
| 45 | (449978.4, 459977.92] | 828 | 0.166667 | 0.003110 | 138.0 | 690.0 | 0.002392 | 0.003308 | 0.544008 | 0.011551 | 0.035025 | 0.017143 |
| 46 | (459977.92, 469977.44] | 757 | 0.163804 | 0.002843 | 124.0 | 633.0 | 0.002149 | 0.003035 | 0.535354 | 0.002862 | 0.008654 | 0.017143 |
| 47 | (469977.44, 479976.96] | 674 | 0.173591 | 0.002531 | 117.0 | 557.0 | 0.002028 | 0.002671 | 0.564881 | 0.009786 | 0.029527 | 0.017143 |
| 48 | (479976.96, 489976.48] | 617 | 0.176661 | 0.002317 | 109.0 | 508.0 | 0.001889 | 0.002436 | 0.574111 | 0.003071 | 0.009230 | 0.017143 |
| 49 | (489976.48, 499976.0] | 607 | 0.191104 | 0.002280 | 116.0 | 491.0 | 0.002010 | 0.002354 | 0.617310 | 0.014443 | 0.043199 | 0.017143 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '< 20k', '20 - 70k', '70 - 80k', '80 - 130k', '130 - 200k', '200 - 250k', '250 - 500k', '> 500k'
df_inputs_prepr['tot_cur_bal:0-20k'] = np.where((df_inputs_prepr['tot_cur_bal'] <= 20000.), 1, 0)
df_inputs_prepr['tot_cur_bal:20-70k'] = np.where((df_inputs_prepr['tot_cur_bal'] > 20000.) & (df_inputs_prepr['tot_cur_bal'] <= 70000.), 1, 0)
df_inputs_prepr['tot_cur_bal:70-80k'] = np.where((df_inputs_prepr['tot_cur_bal'] > 70000.) & (df_inputs_prepr['tot_cur_bal'] <= 80000.), 1, 0)
df_inputs_prepr['tot_cur_bal:80-130k'] = np.where((df_inputs_prepr['tot_cur_bal'] > 80000.) & (df_inputs_prepr['tot_cur_bal'] <= 130000.), 1, 0)
df_inputs_prepr['tot_cur_bal:130-200k'] = np.where((df_inputs_prepr['tot_cur_bal'] > 130000.) & (df_inputs_prepr['tot_cur_bal'] <= 200000.), 1, 0)
df_inputs_prepr['tot_cur_bal:200-250k'] = np.where((df_inputs_prepr['tot_cur_bal'] > 200000.) & (df_inputs_prepr['tot_cur_bal'] <= 250000.), 1, 0)
df_inputs_prepr['tot_cur_bal:250-500k'] = np.where((df_inputs_prepr['tot_cur_bal'] > 250000.) & (df_inputs_prepr['tot_cur_bal'] <= 500000.), 1, 0)
df_inputs_prepr['tot_cur_bal:>500k'] = np.where((df_inputs_prepr['tot_cur_bal'] > 500000.), 1, 0)
Variable: 'open_act_il'¶
# unique values
df_inputs_prepr['open_act_il'].unique()
array([ 0., 4., 1., 5., 2., 3., 8., 9., 7., 12., 6., 10., 23.,
14., 15., 19., 16., 11., 18., 21., 13., 17., 22., 31., 27., 30.,
20., 25., 26., 24., 32., 35., 42., 29., 28., 53., 36., 45., 40.,
34., 37., 33.])
# 'open_act_il'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'open_act_il', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| open_act_il | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 173897 | 0.190676 | 0.634119 | 33158.0 | 140739.0 | 0.563892 | 0.653287 | 0.622275 | NaN | NaN | inf |
| 1 | 1.0 | 28616 | 0.247030 | 0.104349 | 7069.0 | 21547.0 | 0.120217 | 0.100018 | 0.789347 | 0.056354 | 0.167072 | inf |
| 2 | 2.0 | 27842 | 0.254939 | 0.101526 | 7098.0 | 20744.0 | 0.120710 | 0.096290 | 0.812532 | 0.007909 | 0.023185 | inf |
| 3 | 3.0 | 17470 | 0.258558 | 0.063705 | 4517.0 | 12953.0 | 0.076817 | 0.060126 | 0.823126 | 0.003619 | 0.010594 | inf |
| 4 | 4.0 | 9205 | 0.269093 | 0.033566 | 2477.0 | 6728.0 | 0.042124 | 0.031230 | 0.853919 | 0.010535 | 0.030793 | inf |
| 5 | 5.0 | 5079 | 0.273479 | 0.018521 | 1389.0 | 3690.0 | 0.023622 | 0.017128 | 0.866720 | 0.004386 | 0.012801 | inf |
| 6 | 6.0 | 2911 | 0.263483 | 0.010615 | 767.0 | 2144.0 | 0.013044 | 0.009952 | 0.837531 | 0.009996 | 0.029188 | inf |
| 7 | 7.0 | 2033 | 0.250861 | 0.007413 | 510.0 | 1523.0 | 0.008673 | 0.007070 | 0.800584 | 0.012623 | 0.036947 | inf |
| 8 | 8.0 | 1454 | 0.240028 | 0.005302 | 349.0 | 1105.0 | 0.005935 | 0.005129 | 0.768778 | 0.010833 | 0.031807 | inf |
| 9 | 9.0 | 1148 | 0.234321 | 0.004186 | 269.0 | 879.0 | 0.004575 | 0.004080 | 0.751980 | 0.005707 | 0.016797 | inf |
| 10 | 10.0 | 948 | 0.247890 | 0.003457 | 235.0 | 713.0 | 0.003996 | 0.003310 | 0.791872 | 0.013570 | 0.039892 | inf |
| 11 | 11.0 | 783 | 0.246488 | 0.002855 | 193.0 | 590.0 | 0.003282 | 0.002739 | 0.787757 | 0.001402 | 0.004115 | inf |
| 12 | 12.0 | 597 | 0.251256 | 0.002177 | 150.0 | 447.0 | 0.002551 | 0.002075 | 0.801743 | 0.004768 | 0.013987 | inf |
| 13 | 13.0 | 461 | 0.271150 | 0.001681 | 125.0 | 336.0 | 0.002126 | 0.001560 | 0.859923 | 0.019893 | 0.058179 | inf |
| 14 | 14.0 | 361 | 0.293629 | 0.001316 | 106.0 | 255.0 | 0.001803 | 0.001184 | 0.925426 | 0.022479 | 0.065504 | inf |
| 15 | 15.0 | 330 | 0.287879 | 0.001203 | 95.0 | 235.0 | 0.001616 | 0.001091 | 0.908688 | 0.005750 | 0.016739 | inf |
| 16 | 16.0 | 210 | 0.204762 | 0.000766 | 43.0 | 167.0 | 0.000731 | 0.000775 | 0.664410 | 0.083117 | 0.244277 | inf |
| 17 | 17.0 | 210 | 0.261905 | 0.000766 | 55.0 | 155.0 | 0.000935 | 0.000719 | 0.832917 | 0.057143 | 0.168506 | inf |
| 18 | 18.0 | 160 | 0.318750 | 0.000583 | 51.0 | 109.0 | 0.000867 | 0.000506 | 0.998498 | 0.056845 | 0.165581 | inf |
| 19 | 19.0 | 110 | 0.263636 | 0.000401 | 29.0 | 81.0 | 0.000493 | 0.000376 | 0.837979 | 0.055114 | 0.160519 | inf |
| 20 | 20.0 | 91 | 0.219780 | 0.000332 | 20.0 | 71.0 | 0.000340 | 0.000330 | 0.709032 | 0.043856 | 0.128946 | inf |
| 21 | 21.0 | 80 | 0.287500 | 0.000292 | 23.0 | 57.0 | 0.000391 | 0.000265 | 0.907585 | 0.067720 | 0.198552 | inf |
| 22 | 22.0 | 54 | 0.277778 | 0.000197 | 15.0 | 39.0 | 0.000255 | 0.000181 | 0.879257 | 0.009722 | 0.028327 | inf |
| 23 | 23.0 | 43 | 0.348837 | 0.000157 | 15.0 | 28.0 | 0.000255 | 0.000130 | 1.086097 | 0.071059 | 0.206840 | inf |
| 24 | 24.0 | 39 | 0.410256 | 0.000142 | 16.0 | 23.0 | 0.000272 | 0.000107 | 1.266567 | 0.061419 | 0.180470 | inf |
| 25 | 25.0 | 21 | 0.142857 | 0.000077 | 3.0 | 18.0 | 0.000051 | 0.000084 | 0.476616 | 0.267399 | 0.789952 | inf |
| 26 | 26.0 | 23 | 0.217391 | 0.000084 | 5.0 | 18.0 | 0.000085 | 0.000084 | 0.701953 | 0.074534 | 0.225338 | inf |
| 27 | 27.0 | 15 | 0.400000 | 0.000055 | 6.0 | 9.0 | 0.000102 | 0.000042 | 1.236185 | 0.182609 | 0.534232 | inf |
| 28 | 28.0 | 7 | 0.285714 | 0.000026 | 2.0 | 5.0 | 0.000034 | 0.000023 | 0.902384 | 0.114286 | 0.333801 | inf |
| 29 | 29.0 | 10 | 0.300000 | 0.000036 | 3.0 | 7.0 | 0.000051 | 0.000032 | 0.943965 | 0.014286 | 0.041580 | inf |
| 30 | 30.0 | 7 | 0.285714 | 0.000026 | 2.0 | 5.0 | 0.000034 | 0.000023 | 0.902384 | 0.014286 | 0.041580 | inf |
| 31 | 31.0 | 6 | 0.166667 | 0.000022 | 1.0 | 5.0 | 0.000017 | 0.000023 | 0.549702 | 0.119048 | 0.352682 | inf |
| 32 | 32.0 | 3 | 0.333333 | 0.000011 | 1.0 | 2.0 | 0.000017 | 0.000009 | 1.040928 | 0.166667 | 0.491225 | inf |
| 33 | 33.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 0.666667 | inf | inf |
| 34 | 34.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 0.000000 | NaN | inf |
| 35 | 35.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 0.000000 | NaN | inf |
| 36 | 36.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.500000 | inf | inf |
| 37 | 37.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.500000 | 1.539806 | inf |
| 38 | 40.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 39 | 42.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 1.000000 | inf | inf |
| 40 | 45.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 41 | 53.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '0', '1-5', '6-15', '>=16'
df_inputs_prepr['open_act_il:0'] = np.where((df_inputs_prepr['open_act_il'] == 0), 1, 0)
df_inputs_prepr['open_act_il:1-5'] = np.where((df_inputs_prepr['open_act_il'] >= 1) & (df_inputs_prepr['open_act_il'] <= 5), 1, 0)
df_inputs_prepr['open_act_il:6-15'] = np.where((df_inputs_prepr['open_act_il'] >= 6) & (df_inputs_prepr['open_act_il'] <= 15), 1, 0)
df_inputs_prepr['open_act_il:>=16'] = np.where((df_inputs_prepr['open_act_il'] >= 16), 1, 0)
Variable: 'open_il_12m'¶
# unique values
df_inputs_prepr['open_il_12m'].unique()
array([ 0., 4., 1., 3., 2., 5., 6., 8., 9., 7., 11., 10., 20.,
13., 15.])
# 'open_il_12m'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'open_il_12m', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| open_il_12m | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 217562 | 0.200306 | 0.793344 | 43579.0 | 173983.0 | 0.741114 | 0.807601 | 0.651113 | NaN | NaN | 0.014419 |
| 1 | 1.0 | 35641 | 0.257681 | 0.129966 | 9184.0 | 26457.0 | 0.156185 | 0.122809 | 0.820560 | 0.057375 | 0.169447 | 0.014419 |
| 2 | 2.0 | 14460 | 0.279391 | 0.052729 | 4040.0 | 10420.0 | 0.068705 | 0.048368 | 0.883961 | 0.021711 | 0.063401 | 0.014419 |
| 3 | 3.0 | 4417 | 0.304279 | 0.016107 | 1344.0 | 3073.0 | 0.022856 | 0.014264 | 0.956411 | 0.024887 | 0.072450 | 0.014419 |
| 4 | 4.0 | 1453 | 0.307639 | 0.005298 | 447.0 | 1006.0 | 0.007602 | 0.004670 | 0.966185 | 0.003360 | 0.009774 | 0.014419 |
| 5 | 5.0 | 474 | 0.291139 | 0.001728 | 138.0 | 336.0 | 0.002347 | 0.001560 | 0.918180 | 0.016500 | 0.048005 | 0.014419 |
| 6 | 6.0 | 153 | 0.307190 | 0.000558 | 47.0 | 106.0 | 0.000799 | 0.000492 | 0.964877 | 0.016050 | 0.046697 | 0.014419 |
| 7 | 7.0 | 41 | 0.219512 | 0.000150 | 9.0 | 32.0 | 0.000153 | 0.000149 | 0.708238 | 0.087677 | 0.256638 | 0.014419 |
| 8 | 8.0 | 14 | 0.428571 | 0.000051 | 6.0 | 8.0 | 0.000102 | 0.000037 | 1.321159 | 0.209059 | 0.612921 | 0.014419 |
| 9 | 9.0 | 8 | 0.375000 | 0.000029 | 3.0 | 5.0 | 0.000051 | 0.000023 | 1.162592 | 0.053571 | 0.158568 | 0.014419 |
| 10 | 10.0 | 4 | 0.750000 | 0.000015 | 3.0 | 1.0 | 0.000051 | 0.000005 | 2.484161 | 0.375000 | 1.321569 | 0.014419 |
| 11 | 11.0 | 4 | 0.500000 | 0.000015 | 2.0 | 2.0 | 0.000034 | 0.000009 | 1.539806 | 0.250000 | 0.944355 | 0.014419 |
| 12 | 13.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.500000 | 1.539806 | 0.014419 |
| 13 | 15.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | 0.014419 |
| 14 | 20.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | 0.014419 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '0', '1-5', '>=6'
df_inputs_prepr['open_il_12m:0'] = np.where((df_inputs_prepr['open_il_12m'] == 0), 1, 0)
df_inputs_prepr['open_il_12m:1-5'] = np.where((df_inputs_prepr['open_il_12m'] >= 1) & (df_inputs_prepr['open_il_12m'] <= 5), 1, 0)
df_inputs_prepr['open_il_12m:>=6'] = np.where((df_inputs_prepr['open_il_12m'] >= 6), 1, 0)
Variable: 'num_actv_rev_tl'¶
# unique values
df_inputs_prepr['num_actv_rev_tl'].unique()
array([11., 8., 5., 4., 7., 12., 0., 13., 6., 10., 2., 3., 9.,
1., 20., 14., 23., 16., 19., 15., 17., 25., 22., 18., 24., 26.,
21., 29., 28., 30., 32., 42., 27., 31., 33., 34., 39., 36., 43.,
38., 40., 35., 37., 52.])
# 'open_il_12m'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'num_actv_rev_tl', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| num_actv_rev_tl | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 14746 | 0.160179 | 0.053772 | 2362.0 | 12384.0 | 0.040169 | 0.057484 | 0.529907 | NaN | NaN | 0.017707 |
| 1 | 1.0 | 9100 | 0.181429 | 0.033183 | 1651.0 | 7449.0 | 0.028077 | 0.034577 | 0.594443 | 0.021250 | 0.064536 | 0.017707 |
| 2 | 2.0 | 24449 | 0.182380 | 0.089154 | 4459.0 | 19990.0 | 0.075831 | 0.092790 | 0.597312 | 0.000951 | 0.002869 | 0.017707 |
| 3 | 3.0 | 36493 | 0.192256 | 0.133072 | 7016.0 | 29477.0 | 0.119316 | 0.136827 | 0.627016 | 0.009876 | 0.029704 | 0.017707 |
| 4 | 4.0 | 40005 | 0.200850 | 0.145879 | 8035.0 | 31970.0 | 0.136645 | 0.148399 | 0.652737 | 0.008594 | 0.025722 | 0.017707 |
| 5 | 5.0 | 36835 | 0.208497 | 0.134320 | 7680.0 | 29155.0 | 0.130608 | 0.135333 | 0.675536 | 0.007647 | 0.022799 | 0.017707 |
| 6 | 6.0 | 30520 | 0.225557 | 0.111292 | 6884.0 | 23636.0 | 0.117071 | 0.109714 | 0.726123 | 0.017060 | 0.050586 | 0.017707 |
| 7 | 7.0 | 23382 | 0.229151 | 0.085263 | 5358.0 | 18024.0 | 0.091119 | 0.083664 | 0.736736 | 0.003594 | 0.010613 | 0.017707 |
| 8 | 8.0 | 17223 | 0.242408 | 0.062804 | 4175.0 | 13048.0 | 0.071001 | 0.060567 | 0.775776 | 0.013258 | 0.039041 | 0.017707 |
| 9 | 9.0 | 12283 | 0.249613 | 0.044790 | 3066.0 | 9217.0 | 0.052141 | 0.042784 | 0.796926 | 0.007205 | 0.021150 | 0.017707 |
| 10 | 10.0 | 8827 | 0.254220 | 0.032188 | 2244.0 | 6583.0 | 0.038162 | 0.030557 | 0.810428 | 0.004607 | 0.013501 | 0.017707 |
| 11 | 11.0 | 6004 | 0.282811 | 0.021894 | 1698.0 | 4306.0 | 0.028877 | 0.019988 | 0.893928 | 0.028591 | 0.083500 | 0.017707 |
| 12 | 12.0 | 4144 | 0.271477 | 0.015111 | 1125.0 | 3019.0 | 0.019132 | 0.014014 | 0.860878 | 0.011335 | 0.033050 | 0.017707 |
| 13 | 13.0 | 2961 | 0.285714 | 0.010797 | 846.0 | 2115.0 | 0.014387 | 0.009817 | 0.902384 | 0.014237 | 0.041507 | 0.017707 |
| 14 | 14.0 | 2012 | 0.296223 | 0.007337 | 596.0 | 1416.0 | 0.010136 | 0.006573 | 0.932975 | 0.010508 | 0.030591 | 0.017707 |
| 15 | 15.0 | 1506 | 0.291501 | 0.005492 | 439.0 | 1067.0 | 0.007466 | 0.004953 | 0.919232 | 0.004722 | 0.013742 | 0.017707 |
| 16 | 16.0 | 1100 | 0.307273 | 0.004011 | 338.0 | 762.0 | 0.005748 | 0.003537 | 0.965119 | 0.015772 | 0.045887 | 0.017707 |
| 17 | 17.0 | 749 | 0.308411 | 0.002731 | 231.0 | 518.0 | 0.003928 | 0.002404 | 0.968430 | 0.001138 | 0.003311 | 0.017707 |
| 18 | 18.0 | 482 | 0.336100 | 0.001758 | 162.0 | 320.0 | 0.002755 | 0.001485 | 1.048981 | 0.027688 | 0.080551 | 0.017707 |
| 19 | 19.0 | 376 | 0.260638 | 0.001371 | 98.0 | 278.0 | 0.001667 | 0.001290 | 0.829213 | 0.075461 | 0.219768 | 0.017707 |
| 20 | 20.0 | 272 | 0.294118 | 0.000992 | 80.0 | 192.0 | 0.001360 | 0.000891 | 0.926849 | 0.033479 | 0.097636 | 0.017707 |
| 21 | 21.0 | 191 | 0.335079 | 0.000696 | 64.0 | 127.0 | 0.001088 | 0.000590 | 1.046008 | 0.040961 | 0.119159 | 0.017707 |
| 22 | 22.0 | 147 | 0.299320 | 0.000536 | 44.0 | 103.0 | 0.000748 | 0.000478 | 0.941985 | 0.035759 | 0.104023 | 0.017707 |
| 23 | 23.0 | 112 | 0.401786 | 0.000408 | 45.0 | 67.0 | 0.000765 | 0.000311 | 1.241466 | 0.102466 | 0.299481 | 0.017707 |
| 24 | 24.0 | 75 | 0.386667 | 0.000273 | 29.0 | 46.0 | 0.000493 | 0.000214 | 1.196862 | 0.015119 | 0.044604 | 0.017707 |
| 25 | 25.0 | 57 | 0.298246 | 0.000208 | 17.0 | 40.0 | 0.000289 | 0.000186 | 0.938861 | 0.088421 | 0.258001 | 0.017707 |
| 26 | 26.0 | 45 | 0.400000 | 0.000164 | 18.0 | 27.0 | 0.000306 | 0.000125 | 1.236185 | 0.101754 | 0.297325 | 0.017707 |
| 27 | 27.0 | 36 | 0.250000 | 0.000131 | 9.0 | 27.0 | 0.000153 | 0.000125 | 0.798060 | 0.150000 | 0.438125 | 0.017707 |
| 28 | 28.0 | 23 | 0.434783 | 0.000084 | 10.0 | 13.0 | 0.000170 | 0.000060 | 1.339784 | 0.184783 | 0.541724 | 0.017707 |
| 29 | 29.0 | 24 | 0.250000 | 0.000088 | 6.0 | 18.0 | 0.000102 | 0.000084 | 0.798060 | 0.184783 | 0.541724 | 0.017707 |
| 30 | 30.0 | 13 | 0.230769 | 0.000047 | 3.0 | 10.0 | 0.000051 | 0.000046 | 0.741511 | 0.019231 | 0.056549 | 0.017707 |
| 31 | 31.0 | 11 | 0.363636 | 0.000040 | 4.0 | 7.0 | 0.000068 | 0.000032 | 1.129314 | 0.132867 | 0.387803 | 0.017707 |
| 32 | 32.0 | 7 | 0.285714 | 0.000026 | 2.0 | 5.0 | 0.000034 | 0.000023 | 0.902384 | 0.077922 | 0.226930 | 0.017707 |
| 33 | 33.0 | 3 | 0.333333 | 0.000011 | 1.0 | 2.0 | 0.000017 | 0.000009 | 1.040928 | 0.047619 | 0.138543 | 0.017707 |
| 34 | 34.0 | 4 | 0.500000 | 0.000015 | 2.0 | 2.0 | 0.000034 | 0.000009 | 1.539806 | 0.166667 | 0.498878 | 0.017707 |
| 35 | 35.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.500000 | 1.539806 | 0.017707 |
| 36 | 36.0 | 4 | 0.500000 | 0.000015 | 2.0 | 2.0 | 0.000034 | 0.000009 | 1.539806 | 0.500000 | 1.539806 | 0.017707 |
| 37 | 37.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.500000 | 1.539806 | 0.017707 |
| 38 | 38.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.500000 | 1.539806 | 0.017707 |
| 39 | 39.0 | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.500000 | 1.539806 | 0.017707 |
| 40 | 40.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.500000 | 1.539806 | 0.017707 |
| 41 | 42.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.000000 | 0.000000 | 0.017707 |
| 42 | 43.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.500000 | 1.539806 | 0.017707 |
| 43 | 52.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | 0.017707 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0', '1-5', '6-9', '10-13', '14-17', '18-26', '>=27'
# '>=27' will be the reference category
df_inputs_prepr['num_actv_rev_tl:0'] = np.where(df_inputs_prepr['num_actv_rev_tl'].isin([0]), 1, 0)
df_inputs_prepr['num_actv_rev_tl:1-5'] = np.where(df_inputs_prepr['num_actv_rev_tl'].isin(range(1, 6)), 1, 0)
df_inputs_prepr['num_actv_rev_tl:6-9'] = np.where(df_inputs_prepr['num_actv_rev_tl'].isin(range(7, 10)), 1, 0)
df_inputs_prepr['num_actv_rev_tl:10-13'] = np.where(df_inputs_prepr['num_actv_rev_tl'].isin(range(11, 14)), 1, 0)
df_inputs_prepr['num_actv_rev_tl:14-17'] = np.where(df_inputs_prepr['num_actv_rev_tl'].isin(range(14, 18)), 1, 0)
df_inputs_prepr['num_actv_rev_tl:18-26'] = np.where(df_inputs_prepr['num_actv_rev_tl'].isin(range(18, 27)), 1, 0)
df_inputs_prepr['num_actv_rev_tl:>=27'] = np.where(df_inputs_prepr['num_actv_rev_tl'].isin(range(27, 500)), 1, 0)
Variable: 'open_rv_12m'¶
# unique values
df_inputs_prepr['open_rv_12m'].unique()
array([ 0., 2., 3., 1., 5., 6., 4., 10., 7., 8., 9., 11., 12.,
14., 13., 15., 16., 18., 28., 22., 21., 17.])
# 'open_rv_12m'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'open_rv_12m', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| open_rv_12m | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 198437 | 0.193432 | 0.723605 | 38384.0 | 160053.0 | 0.652767 | 0.742940 | 0.630541 | NaN | NaN | inf |
| 1 | 1.0 | 33646 | 0.248053 | 0.122691 | 8346.0 | 25300.0 | 0.141934 | 0.117438 | 0.792350 | 0.054622 | 0.161809 | inf |
| 2 | 2.0 | 20841 | 0.271292 | 0.075997 | 5654.0 | 15187.0 | 0.096153 | 0.070496 | 0.860339 | 0.023239 | 0.067988 | inf |
| 3 | 3.0 | 10931 | 0.290824 | 0.039860 | 3179.0 | 7752.0 | 0.054063 | 0.035984 | 0.917263 | 0.019532 | 0.056925 | inf |
| 4 | 4.0 | 5258 | 0.304869 | 0.019173 | 1603.0 | 3655.0 | 0.027261 | 0.016966 | 0.958127 | 0.014045 | 0.040864 | inf |
| 5 | 5.0 | 2522 | 0.307692 | 0.009197 | 776.0 | 1746.0 | 0.013197 | 0.008105 | 0.966339 | 0.002824 | 0.008212 | inf |
| 6 | 6.0 | 1247 | 0.331997 | 0.004547 | 414.0 | 833.0 | 0.007041 | 0.003867 | 1.037037 | 0.024304 | 0.070698 | inf |
| 7 | 7.0 | 620 | 0.333871 | 0.002261 | 207.0 | 413.0 | 0.003520 | 0.001917 | 1.042493 | 0.001874 | 0.005455 | inf |
| 8 | 8.0 | 321 | 0.295950 | 0.001171 | 95.0 | 226.0 | 0.001616 | 0.001049 | 0.932182 | 0.037921 | 0.110311 | inf |
| 9 | 9.0 | 161 | 0.310559 | 0.000587 | 50.0 | 111.0 | 0.000850 | 0.000515 | 0.974676 | 0.014609 | 0.042494 | inf |
| 10 | 10.0 | 112 | 0.410714 | 0.000408 | 46.0 | 66.0 | 0.000782 | 0.000306 | 1.267927 | 0.100155 | 0.293251 | inf |
| 11 | 11.0 | 47 | 0.340426 | 0.000171 | 16.0 | 31.0 | 0.000272 | 0.000144 | 1.061580 | 0.070289 | 0.206347 | inf |
| 12 | 12.0 | 32 | 0.437500 | 0.000117 | 14.0 | 18.0 | 0.000238 | 0.000084 | 1.347952 | 0.097074 | 0.286372 | inf |
| 13 | 13.0 | 21 | 0.285714 | 0.000077 | 6.0 | 15.0 | 0.000102 | 0.000070 | 0.902384 | 0.151786 | 0.445568 | inf |
| 14 | 14.0 | 13 | 0.384615 | 0.000047 | 5.0 | 8.0 | 0.000085 | 0.000037 | 1.190828 | 0.098901 | 0.288444 | inf |
| 15 | 15.0 | 12 | 0.083333 | 0.000044 | 1.0 | 11.0 | 0.000017 | 0.000051 | 0.287479 | 0.301282 | 0.903349 | inf |
| 16 | 16.0 | 6 | 0.500000 | 0.000022 | 3.0 | 3.0 | 0.000051 | 0.000014 | 1.539806 | 0.416667 | 1.252327 | inf |
| 17 | 17.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.500000 | 1.539806 | inf |
| 18 | 18.0 | 3 | 0.666667 | 0.000011 | 2.0 | 1.0 | 0.000034 | 0.000005 | 2.119548 | 0.666667 | 2.119548 | inf |
| 19 | 21.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 0.333333 | inf | inf |
| 20 | 22.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 1.000000 | inf | inf |
| 21 | 28.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0', '1-2', '3-5', '6-8', '9-13', '>=14'
# '>=14' will be the reference category
df_inputs_prepr['open_rv_12m:0'] = np.where(df_inputs_prepr['open_rv_12m'].isin([0]), 1, 0)
df_inputs_prepr['open_rv_12m:1-2'] = np.where(df_inputs_prepr['open_rv_12m'].isin(range(1, 3)), 1, 0)
df_inputs_prepr['open_rv_12m:3-5'] = np.where(df_inputs_prepr['open_rv_12m'].isin(range(3, 6)), 1, 0)
df_inputs_prepr['open_rv_12m:6-8'] = np.where(df_inputs_prepr['open_rv_12m'].isin(range(6, 9)), 1, 0)
df_inputs_prepr['open_rv_12m:9-13'] = np.where(df_inputs_prepr['open_rv_12m'].isin(range(9, 14)), 1, 0)
df_inputs_prepr['open_rv_12m:>=14'] = np.where(df_inputs_prepr['open_rv_12m'].isin(range(14, 500)), 1, 0)
Variable: 'num_bc_tl'¶
# unique values
df_inputs_prepr['num_bc_tl'].unique()
array([14., 8., 3., 5., 11., 6., 18., 0., 4., 13., 7., 9., 19.,
10., 15., 12., 16., 2., 17., 1., 37., 22., 23., 29., 26., 27.,
20., 28., 21., 33., 36., 24., 49., 34., 25., 35., 31., 32., 38.,
39., 30., 42., 47., 41., 44., 66., 54., 43., 40., 53., 45., 51.,
46., 48., 56., 61., 60.])
# 'num_bc_tl'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'num_bc_tl', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| num_bc_tl | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 14229 | 0.162345 | 0.051886 | 2310.0 | 11919.0 | 0.039284 | 0.055326 | 0.536524 | NaN | NaN | inf |
| 1 | 1.0 | 4796 | 0.255421 | 0.017489 | 1225.0 | 3571.0 | 0.020833 | 0.016576 | 0.813946 | 0.093077 | 0.277422 | inf |
| 2 | 2.0 | 12324 | 0.240750 | 0.044940 | 2967.0 | 9357.0 | 0.050457 | 0.043434 | 0.770901 | 0.014671 | 0.043044 | inf |
| 3 | 3.0 | 19557 | 0.227080 | 0.071315 | 4441.0 | 15116.0 | 0.075525 | 0.070166 | 0.730622 | 0.013670 | 0.040280 | inf |
| 4 | 4.0 | 24433 | 0.226743 | 0.089095 | 5540.0 | 18893.0 | 0.094214 | 0.087698 | 0.729625 | 0.000337 | 0.000996 | inf |
| 5 | 5.0 | 26231 | 0.220655 | 0.095652 | 5788.0 | 20443.0 | 0.098432 | 0.094893 | 0.711623 | 0.006088 | 0.018003 | inf |
| 6 | 6.0 | 25902 | 0.221064 | 0.094452 | 5726.0 | 20176.0 | 0.097378 | 0.093654 | 0.712834 | 0.000409 | 0.001211 | inf |
| 7 | 7.0 | 24415 | 0.211960 | 0.089030 | 5175.0 | 19240.0 | 0.088007 | 0.089309 | 0.685833 | 0.009104 | 0.027001 | inf |
| 8 | 8.0 | 22272 | 0.212195 | 0.081215 | 4726.0 | 17546.0 | 0.080371 | 0.081446 | 0.686531 | 0.000235 | 0.000698 | inf |
| 9 | 9.0 | 19202 | 0.208364 | 0.070020 | 4001.0 | 15201.0 | 0.068042 | 0.070561 | 0.675139 | 0.003831 | 0.011392 | inf |
| 10 | 10.0 | 15902 | 0.216011 | 0.057987 | 3435.0 | 12467.0 | 0.058416 | 0.057870 | 0.697859 | 0.007647 | 0.022720 | inf |
| 11 | 11.0 | 13332 | 0.210471 | 0.048615 | 2806.0 | 10526.0 | 0.047719 | 0.048860 | 0.681407 | 0.005540 | 0.016451 | inf |
| 12 | 12.0 | 10881 | 0.202095 | 0.039678 | 2199.0 | 8682.0 | 0.037397 | 0.040300 | 0.656456 | 0.008376 | 0.024951 | inf |
| 13 | 13.0 | 8740 | 0.216362 | 0.031871 | 1891.0 | 6849.0 | 0.032159 | 0.031792 | 0.698900 | 0.014266 | 0.042444 | inf |
| 14 | 14.0 | 7033 | 0.200768 | 0.025646 | 1412.0 | 5621.0 | 0.024013 | 0.026092 | 0.652492 | 0.015594 | 0.046408 | inf |
| 15 | 15.0 | 5433 | 0.214062 | 0.019812 | 1163.0 | 4270.0 | 0.019778 | 0.019821 | 0.692077 | 0.013294 | 0.039585 | inf |
| 16 | 16.0 | 4327 | 0.203605 | 0.015778 | 881.0 | 3446.0 | 0.014982 | 0.015996 | 0.660961 | 0.010457 | 0.031116 | inf |
| 17 | 17.0 | 3390 | 0.202655 | 0.012362 | 687.0 | 2703.0 | 0.011683 | 0.012547 | 0.658126 | 0.000950 | 0.002835 | inf |
| 18 | 18.0 | 2626 | 0.193450 | 0.009576 | 508.0 | 2118.0 | 0.008639 | 0.009831 | 0.630596 | 0.009205 | 0.027529 | inf |
| 19 | 19.0 | 2023 | 0.207118 | 0.007377 | 419.0 | 1604.0 | 0.007126 | 0.007446 | 0.671431 | 0.013668 | 0.040834 | inf |
| 20 | 20.0 | 1643 | 0.210590 | 0.005991 | 346.0 | 1297.0 | 0.005884 | 0.006020 | 0.681762 | 0.003472 | 0.010332 | inf |
| 21 | 21.0 | 1244 | 0.198553 | 0.004536 | 247.0 | 997.0 | 0.004201 | 0.004628 | 0.645874 | 0.012037 | 0.035888 | inf |
| 22 | 22.0 | 993 | 0.204431 | 0.003621 | 203.0 | 790.0 | 0.003452 | 0.003667 | 0.663424 | 0.005878 | 0.017550 | inf |
| 23 | 23.0 | 720 | 0.234722 | 0.002625 | 169.0 | 551.0 | 0.002874 | 0.002558 | 0.753163 | 0.030291 | 0.089740 | inf |
| 24 | 24.0 | 556 | 0.219424 | 0.002027 | 122.0 | 434.0 | 0.002075 | 0.002015 | 0.707979 | 0.015298 | 0.045185 | inf |
| 25 | 25.0 | 423 | 0.191489 | 0.001542 | 81.0 | 342.0 | 0.001378 | 0.001588 | 0.624716 | 0.027935 | 0.083263 | inf |
| 26 | 26.0 | 351 | 0.233618 | 0.001280 | 82.0 | 269.0 | 0.001395 | 0.001249 | 0.749911 | 0.042129 | 0.125195 | inf |
| 27 | 27.0 | 266 | 0.191729 | 0.000970 | 51.0 | 215.0 | 0.000867 | 0.000998 | 0.625436 | 0.041889 | 0.124475 | inf |
| 28 | 28.0 | 227 | 0.167401 | 0.000828 | 38.0 | 189.0 | 0.000646 | 0.000877 | 0.551937 | 0.024328 | 0.073499 | inf |
| 29 | 29.0 | 168 | 0.190476 | 0.000613 | 32.0 | 136.0 | 0.000544 | 0.000631 | 0.621675 | 0.023075 | 0.069737 | inf |
| 30 | 30.0 | 124 | 0.209677 | 0.000452 | 26.0 | 98.0 | 0.000442 | 0.000455 | 0.679047 | 0.019201 | 0.057373 | inf |
| 31 | 31.0 | 93 | 0.258065 | 0.000339 | 24.0 | 69.0 | 0.000408 | 0.000320 | 0.821683 | 0.048387 | 0.142636 | inf |
| 32 | 32.0 | 82 | 0.207317 | 0.000299 | 17.0 | 65.0 | 0.000289 | 0.000302 | 0.672023 | 0.050747 | 0.149661 | inf |
| 33 | 33.0 | 78 | 0.153846 | 0.000284 | 12.0 | 66.0 | 0.000204 | 0.000306 | 0.510500 | 0.053471 | 0.161523 | inf |
| 34 | 34.0 | 37 | 0.216216 | 0.000135 | 8.0 | 29.0 | 0.000136 | 0.000135 | 0.698469 | 0.062370 | 0.187969 | inf |
| 35 | 35.0 | 35 | 0.314286 | 0.000128 | 11.0 | 24.0 | 0.000187 | 0.000111 | 0.985514 | 0.098069 | 0.287045 | inf |
| 36 | 36.0 | 29 | 0.241379 | 0.000106 | 7.0 | 22.0 | 0.000119 | 0.000102 | 0.772752 | 0.072906 | 0.212762 | inf |
| 37 | 37.0 | 18 | 0.111111 | 0.000066 | 2.0 | 16.0 | 0.000034 | 0.000074 | 0.377039 | 0.130268 | 0.395713 | inf |
| 38 | 38.0 | 23 | 0.304348 | 0.000084 | 7.0 | 16.0 | 0.000119 | 0.000074 | 0.956612 | 0.193237 | 0.579573 | inf |
| 39 | 39.0 | 17 | 0.235294 | 0.000062 | 4.0 | 13.0 | 0.000068 | 0.000060 | 0.754848 | 0.069054 | 0.201764 | inf |
| 40 | 40.0 | 8 | 0.250000 | 0.000029 | 2.0 | 6.0 | 0.000034 | 0.000028 | 0.798060 | 0.014706 | 0.043213 | inf |
| 41 | 41.0 | 10 | 0.000000 | 0.000036 | 0.0 | 10.0 | 0.000000 | 0.000046 | 0.000000 | 0.250000 | 0.798060 | inf |
| 42 | 42.0 | 6 | 0.166667 | 0.000022 | 1.0 | 5.0 | 0.000017 | 0.000023 | 0.549702 | 0.166667 | 0.549702 | inf |
| 43 | 43.0 | 5 | 0.200000 | 0.000018 | 1.0 | 4.0 | 0.000017 | 0.000019 | 0.650199 | 0.033333 | 0.100496 | inf |
| 44 | 44.0 | 8 | 0.250000 | 0.000029 | 2.0 | 6.0 | 0.000034 | 0.000028 | 0.798060 | 0.050000 | 0.147862 | inf |
| 45 | 45.0 | 4 | 0.250000 | 0.000015 | 1.0 | 3.0 | 0.000017 | 0.000014 | 0.798060 | 0.000000 | 0.000000 | inf |
| 46 | 46.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.250000 | 0.798060 | inf |
| 47 | 47.0 | 3 | 0.666667 | 0.000011 | 2.0 | 1.0 | 0.000034 | 0.000005 | 2.119548 | 0.666667 | 2.119548 | inf |
| 48 | 48.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.666667 | 2.119548 | inf |
| 49 | 49.0 | 2 | 1.000000 | 0.000007 | 2.0 | 0.0 | 0.000034 | 0.000000 | inf | 1.000000 | inf | inf |
| 50 | 51.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.500000 | inf | inf |
| 51 | 53.0 | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.500000 | 1.539806 | inf |
| 52 | 54.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 53 | 56.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 54 | 60.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 1.000000 | inf | inf |
| 55 | 61.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 56 | 66.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0', '1-5', '6-10', '11-20', '21-32', '>=33'
# '>=33' will be the reference category
df_inputs_prepr['num_bc_tl:0'] = np.where(df_inputs_prepr['num_bc_tl'].isin([0]), 1, 0)
df_inputs_prepr['num_bc_tl:1-5'] = np.where(df_inputs_prepr['num_bc_tl'].isin(range(1, 6)), 1, 0)
df_inputs_prepr['num_bc_tl:6-10'] = np.where(df_inputs_prepr['num_bc_tl'].isin(range(6, 11)), 1, 0)
df_inputs_prepr['num_bc_tl:11-20'] = np.where(df_inputs_prepr['num_bc_tl'].isin(range(11, 21)), 1, 0)
df_inputs_prepr['num_bc_tl:21-32'] = np.where(df_inputs_prepr['num_bc_tl'].isin(range(21, 33)), 1, 0)
df_inputs_prepr['num_bc_tl:>=33'] = np.where(df_inputs_prepr['num_bc_tl'].isin(range(33, 500)), 1, 0)
Variable: 'open_acc_6m'¶
# unique values
df_inputs_prepr['open_acc_6m'].unique()
array([ 0., 3., 1., 2., 4., 12., 5., 6., 9., 7., 8., 11., 14.,
10., 15.])
# 'open_acc_6m'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'open_acc_6m', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| open_acc_6m | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 207285 | 0.196367 | 0.755869 | 40704.0 | 166581.0 | 0.692221 | 0.773242 | 0.639335 | NaN | NaN | 0.019426 |
| 1 | 1.0 | 35777 | 0.255443 | 0.130462 | 9139.0 | 26638.0 | 0.155420 | 0.123649 | 0.814011 | 0.059076 | 0.174676 | 0.019426 |
| 2 | 2.0 | 18505 | 0.275061 | 0.067479 | 5090.0 | 13415.0 | 0.086562 | 0.062270 | 0.871334 | 0.019617 | 0.057323 | 0.019426 |
| 3 | 3.0 | 7791 | 0.296496 | 0.028410 | 2310.0 | 5481.0 | 0.039284 | 0.025442 | 0.933770 | 0.021435 | 0.062436 | 0.019426 |
| 4 | 4.0 | 3070 | 0.308469 | 0.011195 | 947.0 | 2123.0 | 0.016105 | 0.009855 | 0.968598 | 0.011973 | 0.034828 | 0.019426 |
| 5 | 5.0 | 1058 | 0.341210 | 0.003858 | 361.0 | 697.0 | 0.006139 | 0.003235 | 1.063865 | 0.032741 | 0.095267 | 0.019426 |
| 6 | 6.0 | 467 | 0.336188 | 0.001703 | 157.0 | 310.0 | 0.002670 | 0.001439 | 1.049240 | 0.005021 | 0.014625 | 0.019426 |
| 7 | 7.0 | 142 | 0.359155 | 0.000518 | 51.0 | 91.0 | 0.000867 | 0.000422 | 1.116214 | 0.022966 | 0.066975 | 0.019426 |
| 8 | 8.0 | 65 | 0.230769 | 0.000237 | 15.0 | 50.0 | 0.000255 | 0.000232 | 0.741511 | 0.128386 | 0.374703 | 0.019426 |
| 9 | 9.0 | 44 | 0.386364 | 0.000160 | 17.0 | 27.0 | 0.000289 | 0.000125 | 1.195970 | 0.155594 | 0.454459 | 0.019426 |
| 10 | 10.0 | 11 | 0.363636 | 0.000040 | 4.0 | 7.0 | 0.000068 | 0.000032 | 1.129314 | 0.022727 | 0.066656 | 0.019426 |
| 11 | 11.0 | 9 | 0.333333 | 0.000033 | 3.0 | 6.0 | 0.000051 | 0.000028 | 1.040928 | 0.030303 | 0.088387 | 0.019426 |
| 12 | 12.0 | 5 | 0.400000 | 0.000018 | 2.0 | 3.0 | 0.000034 | 0.000014 | 1.236185 | 0.066667 | 0.195258 | 0.019426 |
| 13 | 14.0 | 4 | 0.500000 | 0.000015 | 2.0 | 2.0 | 0.000034 | 0.000009 | 1.539806 | 0.100000 | 0.303621 | 0.019426 |
| 14 | 15.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.500000 | 1.539806 | 0.019426 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0', '1-3', '4-7', '>=8'
# '>=8' will be the reference category
df_inputs_prepr['open_acc_6m:0'] = np.where(df_inputs_prepr['open_acc_6m'].isin([0]), 1, 0)
df_inputs_prepr['open_acc_6m:1-3'] = np.where(df_inputs_prepr['open_acc_6m'].isin(range(1, 4)), 1, 0)
df_inputs_prepr['open_acc_6m:4-7'] = np.where(df_inputs_prepr['open_acc_6m'].isin(range(4, 8)), 1, 0)
df_inputs_prepr['open_acc_6m:>=8'] = np.where(df_inputs_prepr['open_acc_6m'].isin(range(8, 500)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1521767103.py:6: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['open_acc_6m:>=8'] = np.where(df_inputs_prepr['open_acc_6m'].isin(range(8, 500)), 1, 0)
Variable: 'acc_open_past_24mths'¶
# unique values
df_inputs_prepr['acc_open_past_24mths'].unique()
array([ 4., 9., 8., 2., 10., 3., 5., 13., 6., 7., 0., 1., 12.,
17., 11., 16., 14., 15., 18., 19., 20., 27., 29., 21., 22., 26.,
25., 24., 23., 31., 33., 34., 30., 32., 28., 39., 40., 36., 41.,
35., 38., 42., 46.])
# 'acc_open_past_24mths'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'acc_open_past_24mths', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| acc_open_past_24mths | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 19825 | 0.153090 | 0.072292 | 3035.0 | 16790.0 | 0.051614 | 0.077936 | 0.508176 | NaN | NaN | inf |
| 1 | 1.0 | 24052 | 0.160111 | 0.087706 | 3851.0 | 20201.0 | 0.065491 | 0.093770 | 0.529700 | 0.007022 | 0.021524 | inf |
| 2 | 2.0 | 34456 | 0.175238 | 0.125645 | 6038.0 | 28418.0 | 0.102684 | 0.131912 | 0.575729 | 0.015127 | 0.046029 | inf |
| 3 | 3.0 | 38994 | 0.193773 | 0.142192 | 7556.0 | 31438.0 | 0.128499 | 0.145930 | 0.631566 | 0.018535 | 0.055836 | inf |
| 4 | 4.0 | 37269 | 0.210550 | 0.135902 | 7847.0 | 29422.0 | 0.133448 | 0.136572 | 0.681643 | 0.016777 | 0.050078 | inf |
| 5 | 5.0 | 31871 | 0.221989 | 0.116218 | 7075.0 | 24796.0 | 0.120319 | 0.115099 | 0.715570 | 0.011438 | 0.033927 | inf |
| 6 | 6.0 | 25390 | 0.240843 | 0.092585 | 6115.0 | 19275.0 | 0.103993 | 0.089471 | 0.771175 | 0.018854 | 0.055605 | inf |
| 7 | 7.0 | 19210 | 0.250182 | 0.070050 | 4806.0 | 14404.0 | 0.081732 | 0.066861 | 0.798595 | 0.009339 | 0.027420 | inf |
| 8 | 8.0 | 13730 | 0.274144 | 0.050067 | 3764.0 | 9966.0 | 0.064011 | 0.046261 | 0.868660 | 0.023962 | 0.070066 | inf |
| 9 | 9.0 | 9350 | 0.274545 | 0.034095 | 2567.0 | 6783.0 | 0.043655 | 0.031486 | 0.869831 | 0.000401 | 0.001170 | inf |
| 10 | 10.0 | 6548 | 0.295968 | 0.023877 | 1938.0 | 4610.0 | 0.032958 | 0.021399 | 0.932234 | 0.021423 | 0.062403 | inf |
| 11 | 11.0 | 4405 | 0.300795 | 0.016063 | 1325.0 | 3080.0 | 0.022533 | 0.014297 | 0.946276 | 0.004826 | 0.014042 | inf |
| 12 | 12.0 | 2919 | 0.298047 | 0.010644 | 870.0 | 2049.0 | 0.014795 | 0.009511 | 0.938283 | 0.002747 | 0.007992 | inf |
| 13 | 13.0 | 1897 | 0.317343 | 0.006917 | 602.0 | 1295.0 | 0.010238 | 0.006011 | 0.994406 | 0.019296 | 0.056123 | inf |
| 14 | 14.0 | 1344 | 0.308036 | 0.004901 | 414.0 | 930.0 | 0.007041 | 0.004317 | 0.967338 | 0.009307 | 0.027068 | inf |
| 15 | 15.0 | 894 | 0.329978 | 0.003260 | 295.0 | 599.0 | 0.005017 | 0.002780 | 1.031161 | 0.021942 | 0.063823 | inf |
| 16 | 16.0 | 591 | 0.340102 | 0.002155 | 201.0 | 390.0 | 0.003418 | 0.001810 | 1.060636 | 0.010124 | 0.029475 | inf |
| 17 | 17.0 | 399 | 0.293233 | 0.001455 | 117.0 | 282.0 | 0.001990 | 0.001309 | 0.924275 | 0.046868 | 0.136361 | inf |
| 18 | 18.0 | 318 | 0.371069 | 0.001160 | 118.0 | 200.0 | 0.002007 | 0.000928 | 1.151070 | 0.077836 | 0.226795 | inf |
| 19 | 19.0 | 227 | 0.356828 | 0.000828 | 81.0 | 146.0 | 0.001378 | 0.000678 | 1.109418 | 0.014241 | 0.041652 | inf |
| 20 | 20.0 | 144 | 0.381944 | 0.000525 | 55.0 | 89.0 | 0.000935 | 0.000413 | 1.182976 | 0.025116 | 0.073559 | inf |
| 21 | 21.0 | 110 | 0.309091 | 0.000401 | 34.0 | 76.0 | 0.000578 | 0.000353 | 0.970406 | 0.072854 | 0.212570 | inf |
| 22 | 22.0 | 86 | 0.325581 | 0.000314 | 28.0 | 58.0 | 0.000476 | 0.000269 | 1.018369 | 0.016490 | 0.047963 | inf |
| 23 | 23.0 | 54 | 0.259259 | 0.000197 | 14.0 | 40.0 | 0.000238 | 0.000186 | 0.825179 | 0.066322 | 0.193190 | inf |
| 24 | 24.0 | 37 | 0.243243 | 0.000135 | 9.0 | 28.0 | 0.000153 | 0.000130 | 0.778229 | 0.016016 | 0.046950 | inf |
| 25 | 25.0 | 28 | 0.500000 | 0.000102 | 14.0 | 14.0 | 0.000238 | 0.000065 | 1.539806 | 0.256757 | 0.761577 | inf |
| 26 | 26.0 | 21 | 0.476190 | 0.000077 | 10.0 | 11.0 | 0.000170 | 0.000051 | 1.465711 | 0.023810 | 0.074095 | inf |
| 27 | 27.0 | 10 | 0.300000 | 0.000036 | 3.0 | 7.0 | 0.000051 | 0.000032 | 0.943965 | 0.176190 | 0.521747 | inf |
| 28 | 28.0 | 9 | 0.222222 | 0.000033 | 2.0 | 7.0 | 0.000034 | 0.000032 | 0.716262 | 0.077778 | 0.227703 | inf |
| 29 | 29.0 | 11 | 0.636364 | 0.000040 | 7.0 | 4.0 | 0.000119 | 0.000019 | 2.003026 | 0.414141 | 1.286764 | inf |
| 30 | 30.0 | 7 | 0.142857 | 0.000026 | 1.0 | 6.0 | 0.000017 | 0.000028 | 0.476616 | 0.493506 | 1.526410 | inf |
| 31 | 31.0 | 8 | 0.500000 | 0.000029 | 4.0 | 4.0 | 0.000068 | 0.000019 | 1.539806 | 0.357143 | 1.063190 | inf |
| 32 | 32.0 | 6 | 0.166667 | 0.000022 | 1.0 | 5.0 | 0.000017 | 0.000023 | 0.549702 | 0.333333 | 0.990104 | inf |
| 33 | 33.0 | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.166667 | 0.549702 | inf |
| 34 | 34.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 35 | 35.0 | 3 | 0.333333 | 0.000011 | 1.0 | 2.0 | 0.000017 | 0.000009 | 1.040928 | 0.666667 | inf | inf |
| 36 | 36.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.333333 | 1.040928 | inf |
| 37 | 38.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 38 | 39.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 39 | 40.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 1.000000 | inf | inf |
| 40 | 41.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 41 | 42.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 1.000000 | inf | inf |
| 42 | 46.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0-3', '4-7', '8-13', '14-21', '>=22'
# '>=22' will be the reference category
df_inputs_prepr['acc_open_past_24mths:0-3'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(0, 4)), 1, 0)
df_inputs_prepr['acc_open_past_24mths:4-7'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(4, 8)), 1, 0)
df_inputs_prepr['acc_open_past_24mths:8-13'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(8, 14)), 1, 0)
df_inputs_prepr['acc_open_past_24mths:14-21'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(14, 22)), 1, 0)
df_inputs_prepr['acc_open_past_24mths:>=22'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(22, 500)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1260450660.py:3: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['acc_open_past_24mths:0-3'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(0, 4)), 1, 0) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1260450660.py:4: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['acc_open_past_24mths:4-7'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(4, 8)), 1, 0) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1260450660.py:5: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['acc_open_past_24mths:8-13'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(8, 14)), 1, 0) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1260450660.py:6: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['acc_open_past_24mths:14-21'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(14, 22)), 1, 0) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1260450660.py:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['acc_open_past_24mths:>=22'] = np.where(df_inputs_prepr['acc_open_past_24mths'].isin(range(22, 500)), 1, 0)
Variable: 'total_cu_tl'¶
# unique values
df_inputs_prepr['total_cu_tl'].unique()
array([ 0., 2., 1., 8., 4., 3., 5., 6., 10., 11., 22., 9., 7.,
12., 17., 13., 15., 24., 14., 19., 16., 31., 23., 20., 21., 18.,
27., 28., 33., 26., 38., 34., 25., 29., 48., 37., 32., 43., 30.,
40., 41.])
# 'total_cu_tl'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'total_cu_tl', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| total_cu_tl | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 220366 | 0.206797 | 0.803569 | 45571.0 | 174795.0 | 0.774991 | 0.811370 | 0.670474 | NaN | NaN | inf |
| 1 | 1.0 | 18961 | 0.251516 | 0.069142 | 4769.0 | 14192.0 | 0.081103 | 0.065877 | 0.802506 | 0.044719 | 0.132032 | inf |
| 2 | 2.0 | 10641 | 0.247251 | 0.038803 | 2631.0 | 8010.0 | 0.044743 | 0.037181 | 0.789997 | 0.004265 | 0.012508 | inf |
| 3 | 3.0 | 6975 | 0.245591 | 0.025434 | 1713.0 | 5262.0 | 0.029132 | 0.024425 | 0.785125 | 0.001660 | 0.004872 | inf |
| 4 | 4.0 | 4696 | 0.244676 | 0.017124 | 1149.0 | 3547.0 | 0.019540 | 0.016465 | 0.782439 | 0.000915 | 0.002687 | inf |
| 5 | 5.0 | 3361 | 0.238322 | 0.012256 | 801.0 | 2560.0 | 0.013622 | 0.011883 | 0.763761 | 0.006354 | 0.018678 | inf |
| 6 | 6.0 | 2405 | 0.241996 | 0.008770 | 582.0 | 1823.0 | 0.009898 | 0.008462 | 0.774564 | 0.003674 | 0.010803 | inf |
| 7 | 7.0 | 1715 | 0.233819 | 0.006254 | 401.0 | 1314.0 | 0.006819 | 0.006099 | 0.750503 | 0.008177 | 0.024061 | inf |
| 8 | 8.0 | 1313 | 0.221630 | 0.004788 | 291.0 | 1022.0 | 0.004949 | 0.004744 | 0.714509 | 0.012189 | 0.035994 | inf |
| 9 | 9.0 | 941 | 0.224230 | 0.003431 | 211.0 | 730.0 | 0.003588 | 0.003389 | 0.722199 | 0.002600 | 0.007690 | inf |
| 10 | 10.0 | 683 | 0.248902 | 0.002491 | 170.0 | 513.0 | 0.002891 | 0.002381 | 0.794840 | 0.024672 | 0.072641 | inf |
| 11 | 11.0 | 520 | 0.240385 | 0.001896 | 125.0 | 395.0 | 0.002126 | 0.001834 | 0.769828 | 0.008517 | 0.025012 | inf |
| 12 | 12.0 | 392 | 0.252551 | 0.001429 | 99.0 | 293.0 | 0.001684 | 0.001360 | 0.805538 | 0.012166 | 0.035710 | inf |
| 13 | 13.0 | 284 | 0.193662 | 0.001036 | 55.0 | 229.0 | 0.000935 | 0.001063 | 0.631232 | 0.058889 | 0.174307 | inf |
| 14 | 14.0 | 225 | 0.213333 | 0.000820 | 48.0 | 177.0 | 0.000816 | 0.000822 | 0.689913 | 0.019671 | 0.058681 | inf |
| 15 | 15.0 | 164 | 0.195122 | 0.000598 | 32.0 | 132.0 | 0.000544 | 0.000613 | 0.635606 | 0.018211 | 0.054307 | inf |
| 16 | 16.0 | 148 | 0.283784 | 0.000540 | 42.0 | 106.0 | 0.000714 | 0.000492 | 0.896761 | 0.088662 | 0.261155 | inf |
| 17 | 17.0 | 81 | 0.271605 | 0.000295 | 22.0 | 59.0 | 0.000374 | 0.000274 | 0.861251 | 0.012179 | 0.035509 | inf |
| 18 | 18.0 | 73 | 0.164384 | 0.000266 | 12.0 | 61.0 | 0.000204 | 0.000283 | 0.542746 | 0.107221 | 0.318506 | inf |
| 19 | 19.0 | 56 | 0.303571 | 0.000204 | 17.0 | 39.0 | 0.000289 | 0.000181 | 0.954353 | 0.139188 | 0.411608 | inf |
| 20 | 20.0 | 53 | 0.188679 | 0.000193 | 10.0 | 43.0 | 0.000170 | 0.000200 | 0.616277 | 0.114892 | 0.338077 | inf |
| 21 | 21.0 | 40 | 0.225000 | 0.000146 | 9.0 | 31.0 | 0.000153 | 0.000144 | 0.724476 | 0.036321 | 0.108200 | inf |
| 22 | 22.0 | 19 | 0.315789 | 0.000069 | 6.0 | 13.0 | 0.000102 | 0.000060 | 0.989887 | 0.090789 | 0.265411 | inf |
| 23 | 23.0 | 26 | 0.192308 | 0.000095 | 5.0 | 21.0 | 0.000085 | 0.000097 | 0.627171 | 0.123482 | 0.362717 | inf |
| 24 | 24.0 | 21 | 0.619048 | 0.000077 | 13.0 | 8.0 | 0.000221 | 0.000037 | 1.939243 | 0.426740 | 1.312073 | inf |
| 25 | 25.0 | 16 | 0.187500 | 0.000058 | 3.0 | 13.0 | 0.000051 | 0.000060 | 0.612732 | 0.431548 | 1.326512 | inf |
| 26 | 26.0 | 11 | 0.363636 | 0.000040 | 4.0 | 7.0 | 0.000068 | 0.000032 | 1.129314 | 0.176136 | 0.516583 | inf |
| 27 | 27.0 | 9 | 0.222222 | 0.000033 | 2.0 | 7.0 | 0.000034 | 0.000032 | 0.716262 | 0.141414 | 0.413053 | inf |
| 28 | 28.0 | 8 | 0.125000 | 0.000029 | 1.0 | 7.0 | 0.000017 | 0.000032 | 0.420934 | 0.097222 | 0.295328 | inf |
| 29 | 29.0 | 4 | 0.250000 | 0.000015 | 1.0 | 3.0 | 0.000017 | 0.000014 | 0.798060 | 0.125000 | 0.377126 | inf |
| 30 | 30.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.250000 | 0.798060 | inf |
| 31 | 31.0 | 6 | 0.333333 | 0.000022 | 2.0 | 4.0 | 0.000034 | 0.000019 | 1.040928 | 0.333333 | 1.040928 | inf |
| 32 | 32.0 | 5 | 0.200000 | 0.000018 | 1.0 | 4.0 | 0.000017 | 0.000019 | 0.650199 | 0.133333 | 0.390729 | inf |
| 33 | 33.0 | 4 | 0.250000 | 0.000015 | 1.0 | 3.0 | 0.000017 | 0.000014 | 0.798060 | 0.050000 | 0.147862 | inf |
| 34 | 34.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.250000 | 0.798060 | inf |
| 35 | 37.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 36 | 38.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.500000 | 1.539806 | inf |
| 37 | 40.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.500000 | 1.539806 | inf |
| 38 | 41.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 39 | 43.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 1.000000 | inf | inf |
| 40 | 48.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.500000 | 1.539806 | inf |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0', '1-7', '8-17', '>=18'
# '>=18' will be the reference category
df_inputs_prepr['total_cu_tl:0'] = np.where(df_inputs_prepr['total_cu_tl'].isin([0]), 1, 0)
df_inputs_prepr['total_cu_tl:1-7'] = np.where(df_inputs_prepr['total_cu_tl'].isin(range(1, 8)), 1, 0)
df_inputs_prepr['total_cu_tl:8-17'] = np.where(df_inputs_prepr['total_cu_tl'].isin(range(8, 18)), 1, 0)
df_inputs_prepr['total_cu_tl:>=18'] = np.where(df_inputs_prepr['total_cu_tl'].isin(range(18, 500)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3794626681.py:3: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['total_cu_tl:0'] = np.where(df_inputs_prepr['total_cu_tl'].isin([0]), 1, 0) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3794626681.py:4: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['total_cu_tl:1-7'] = np.where(df_inputs_prepr['total_cu_tl'].isin(range(1, 8)), 1, 0) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3794626681.py:5: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['total_cu_tl:8-17'] = np.where(df_inputs_prepr['total_cu_tl'].isin(range(8, 18)), 1, 0) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3794626681.py:6: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['total_cu_tl:>=18'] = np.where(df_inputs_prepr['total_cu_tl'].isin(range(18, 500)), 1, 0)
Variable: 'inq_last_12m'¶
# unique values
df_inputs_prepr['inq_last_12m'].unique()
array([ 0., 4., 2., 3., 1., 5., 7., 12., 6., 14., 11., 18., 8.,
20., 10., 15., 9., 16., 28., 13., 21., 17., 22., 19., 32., 26.,
23., 25., 29., 24., 33., 34., 31., 30., 27., 40.])
# 'inq_last_12m'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'inq_last_12m', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| inq_last_12m | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 188518 | 0.190889 | 0.687435 | 35986.0 | 152532.0 | 0.611986 | 0.708029 | 0.622914 | NaN | NaN | inf |
| 1 | 1.0 | 26092 | 0.245746 | 0.095145 | 6412.0 | 19680.0 | 0.109044 | 0.091351 | 0.785579 | 0.054857 | 0.162665 | inf |
| 2 | 2.0 | 19977 | 0.256145 | 0.072847 | 5117.0 | 14860.0 | 0.087021 | 0.068978 | 0.816064 | 0.010399 | 0.030485 | inf |
| 3 | 3.0 | 13606 | 0.273629 | 0.049615 | 3723.0 | 9883.0 | 0.063314 | 0.045875 | 0.867158 | 0.017485 | 0.051095 | inf |
| 4 | 4.0 | 9261 | 0.276212 | 0.033770 | 2558.0 | 6703.0 | 0.043502 | 0.031114 | 0.874692 | 0.002583 | 0.007534 | inf |
| 5 | 5.0 | 5760 | 0.284028 | 0.021004 | 1636.0 | 4124.0 | 0.027822 | 0.019143 | 0.897472 | 0.007816 | 0.022780 | inf |
| 6 | 6.0 | 3730 | 0.285255 | 0.013602 | 1064.0 | 2666.0 | 0.018095 | 0.012375 | 0.901045 | 0.001227 | 0.003574 | inf |
| 7 | 7.0 | 2383 | 0.300042 | 0.008690 | 715.0 | 1668.0 | 0.012159 | 0.007743 | 0.944087 | 0.014787 | 0.043041 | inf |
| 8 | 8.0 | 1593 | 0.321406 | 0.005809 | 512.0 | 1081.0 | 0.008707 | 0.005018 | 1.006223 | 0.021364 | 0.062137 | inf |
| 9 | 9.0 | 1024 | 0.303711 | 0.003734 | 311.0 | 713.0 | 0.005289 | 0.003310 | 0.954759 | 0.017695 | 0.051464 | inf |
| 10 | 10.0 | 653 | 0.320061 | 0.002381 | 209.0 | 444.0 | 0.003554 | 0.002061 | 1.002311 | 0.016350 | 0.047552 | inf |
| 11 | 11.0 | 495 | 0.339394 | 0.001805 | 168.0 | 327.0 | 0.002857 | 0.001518 | 1.058575 | 0.019333 | 0.056263 | inf |
| 12 | 12.0 | 293 | 0.300341 | 0.001068 | 88.0 | 205.0 | 0.001497 | 0.000952 | 0.944957 | 0.039053 | 0.113617 | inf |
| 13 | 13.0 | 224 | 0.330357 | 0.000817 | 74.0 | 150.0 | 0.001258 | 0.000696 | 1.032265 | 0.030016 | 0.087308 | inf |
| 14 | 14.0 | 186 | 0.413978 | 0.000678 | 77.0 | 109.0 | 0.001309 | 0.000506 | 1.277625 | 0.083621 | 0.245360 | inf |
| 15 | 15.0 | 104 | 0.384615 | 0.000379 | 40.0 | 64.0 | 0.000680 | 0.000297 | 1.190828 | 0.029363 | 0.086797 | inf |
| 16 | 16.0 | 98 | 0.367347 | 0.000357 | 36.0 | 62.0 | 0.000612 | 0.000288 | 1.140170 | 0.017268 | 0.050657 | inf |
| 17 | 17.0 | 58 | 0.293103 | 0.000211 | 17.0 | 41.0 | 0.000289 | 0.000190 | 0.923897 | 0.074243 | 0.216273 | inf |
| 18 | 18.0 | 42 | 0.309524 | 0.000153 | 13.0 | 29.0 | 0.000221 | 0.000135 | 0.971665 | 0.016420 | 0.047768 | inf |
| 19 | 19.0 | 40 | 0.250000 | 0.000146 | 10.0 | 30.0 | 0.000170 | 0.000139 | 0.798060 | 0.059524 | 0.173605 | inf |
| 20 | 20.0 | 21 | 0.285714 | 0.000077 | 6.0 | 15.0 | 0.000102 | 0.000070 | 0.902384 | 0.035714 | 0.104324 | inf |
| 21 | 21.0 | 16 | 0.562500 | 0.000058 | 9.0 | 7.0 | 0.000153 | 0.000032 | 1.742298 | 0.276786 | 0.839914 | inf |
| 22 | 22.0 | 14 | 0.500000 | 0.000051 | 7.0 | 7.0 | 0.000119 | 0.000032 | 1.539806 | 0.062500 | 0.202492 | inf |
| 23 | 23.0 | 13 | 0.307692 | 0.000047 | 4.0 | 9.0 | 0.000068 | 0.000042 | 0.966339 | 0.192308 | 0.573467 | inf |
| 24 | 24.0 | 7 | 0.428571 | 0.000026 | 3.0 | 4.0 | 0.000051 | 0.000019 | 1.321159 | 0.120879 | 0.354820 | inf |
| 25 | 25.0 | 7 | 0.285714 | 0.000026 | 2.0 | 5.0 | 0.000034 | 0.000023 | 0.902384 | 0.142857 | 0.418775 | inf |
| 26 | 26.0 | 5 | 0.200000 | 0.000018 | 1.0 | 4.0 | 0.000017 | 0.000019 | 0.650199 | 0.085714 | 0.252186 | inf |
| 27 | 27.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.200000 | 0.650199 | inf |
| 28 | 28.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 29 | 29.0 | 4 | 0.250000 | 0.000015 | 1.0 | 3.0 | 0.000017 | 0.000014 | 0.798060 | 0.750000 | inf | inf |
| 30 | 30.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.250000 | 0.798060 | inf |
| 31 | 31.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 32 | 32.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 1.000000 | inf | inf |
| 33 | 33.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 34 | 34.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | inf |
| 35 | 40.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0', '1-4', '5-9', '10-16', '>=17'
# '>=17' will be the reference category
df_inputs_prepr['inq_last_12m:0'] = np.where(df_inputs_prepr['inq_last_12m'].isin([0]), 1, 0)
df_inputs_prepr['inq_last_12m:1-4'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(1, 5)), 1, 0)
df_inputs_prepr['inq_last_12m:5-9'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(5, 10)), 1, 0)
df_inputs_prepr['inq_last_12m:10-16'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(10, 17)), 1, 0)
df_inputs_prepr['inq_last_12m:>=17'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(17, 500)), 1, 0)
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2586074654.py:3: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['inq_last_12m:0'] = np.where(df_inputs_prepr['inq_last_12m'].isin([0]), 1, 0) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2586074654.py:4: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['inq_last_12m:1-4'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(1, 5)), 1, 0) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2586074654.py:5: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['inq_last_12m:5-9'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(5, 10)), 1, 0) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2586074654.py:6: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['inq_last_12m:10-16'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(10, 17)), 1, 0) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2586074654.py:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr['inq_last_12m:>=17'] = np.where(df_inputs_prepr['inq_last_12m'].isin(range(17, 500)), 1, 0)
Variable: 'mths_since_recent_inq'¶
# unique values
df_inputs_prepr['mths_since_recent_inq'].unique()
array([ 5., 3., 21., 0., 1., 19., 2., 11., 4., 999., 12.,
7., 9., 10., 15., 14., 20., 8., 6., 18., 22., 13.,
16., 23., 17., 24., 25.])
# 'mths_since_recent_inq'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'mths_since_recent_inq', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| mths_since_recent_inq | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 22048 | 0.267371 | 0.080398 | 5895.0 | 16153.0 | 0.100252 | 0.074980 | 0.848891 | NaN | NaN | 0.017868 |
| 1 | 1.0 | 27914 | 0.258472 | 0.101789 | 7215.0 | 20699.0 | 0.122700 | 0.096081 | 0.822877 | 0.008899 | 0.026014 | 0.017868 |
| 2 | 2.0 | 22108 | 0.240773 | 0.080617 | 5323.0 | 16785.0 | 0.090524 | 0.077913 | 0.770968 | 0.017700 | 0.051909 | 0.017868 |
| 3 | 3.0 | 19797 | 0.228671 | 0.072190 | 4527.0 | 15270.0 | 0.076987 | 0.070881 | 0.735320 | 0.012102 | 0.035648 | 0.017868 |
| 4 | 4.0 | 17738 | 0.221896 | 0.064682 | 3936.0 | 13802.0 | 0.066936 | 0.064067 | 0.715298 | 0.006775 | 0.020022 | 0.017868 |
| 5 | 5.0 | 15701 | 0.209031 | 0.057254 | 3282.0 | 12419.0 | 0.055814 | 0.057647 | 0.677125 | 0.012865 | 0.038173 | 0.017868 |
| 6 | 6.0 | 13973 | 0.212911 | 0.050953 | 2975.0 | 10998.0 | 0.050594 | 0.051051 | 0.688657 | 0.003879 | 0.011532 | 0.017868 |
| 7 | 7.0 | 13218 | 0.216069 | 0.048200 | 2856.0 | 10362.0 | 0.048570 | 0.048099 | 0.698032 | 0.003158 | 0.009375 | 0.017868 |
| 8 | 8.0 | 11630 | 0.216853 | 0.042409 | 2522.0 | 9108.0 | 0.042890 | 0.042278 | 0.700357 | 0.000784 | 0.002325 | 0.017868 |
| 9 | 9.0 | 10086 | 0.208903 | 0.036779 | 2107.0 | 7979.0 | 0.035832 | 0.037037 | 0.676745 | 0.007950 | 0.023613 | 0.017868 |
| 10 | 10.0 | 8841 | 0.199638 | 0.032239 | 1765.0 | 7076.0 | 0.030016 | 0.032846 | 0.649117 | 0.009265 | 0.027628 | 0.017868 |
| 11 | 11.0 | 7851 | 0.193861 | 0.028629 | 1522.0 | 6329.0 | 0.025883 | 0.029378 | 0.631827 | 0.005777 | 0.017290 | 0.017868 |
| 12 | 12.0 | 6969 | 0.199598 | 0.025413 | 1391.0 | 5578.0 | 0.023656 | 0.025892 | 0.648998 | 0.005738 | 0.017171 | 0.017868 |
| 13 | 13.0 | 6299 | 0.193364 | 0.022969 | 1218.0 | 5081.0 | 0.020714 | 0.023585 | 0.630338 | 0.006234 | 0.018660 | 0.017868 |
| 14 | 14.0 | 5481 | 0.201423 | 0.019987 | 1104.0 | 4377.0 | 0.018775 | 0.020317 | 0.654449 | 0.008059 | 0.024111 | 0.017868 |
| 15 | 15.0 | 4819 | 0.202739 | 0.017573 | 977.0 | 3842.0 | 0.016615 | 0.017834 | 0.658377 | 0.001316 | 0.003928 | 0.017868 |
| 16 | 16.0 | 4109 | 0.208080 | 0.014984 | 855.0 | 3254.0 | 0.014540 | 0.015105 | 0.674294 | 0.005341 | 0.015916 | 0.017868 |
| 17 | 17.0 | 3661 | 0.195575 | 0.013350 | 716.0 | 2945.0 | 0.012176 | 0.013670 | 0.636963 | 0.012505 | 0.037331 | 0.017868 |
| 18 | 18.0 | 3349 | 0.180950 | 0.012212 | 606.0 | 2743.0 | 0.010306 | 0.012733 | 0.592997 | 0.014625 | 0.043966 | 0.017868 |
| 19 | 19.0 | 3052 | 0.186435 | 0.011129 | 569.0 | 2483.0 | 0.009677 | 0.011526 | 0.609528 | 0.005486 | 0.016531 | 0.017868 |
| 20 | 20.0 | 2628 | 0.177702 | 0.009583 | 467.0 | 2161.0 | 0.007942 | 0.010031 | 0.583185 | 0.008733 | 0.026344 | 0.017868 |
| 21 | 21.0 | 2406 | 0.174979 | 0.008774 | 421.0 | 1985.0 | 0.007160 | 0.009214 | 0.574945 | 0.002722 | 0.008239 | 0.017868 |
| 22 | 22.0 | 2187 | 0.172840 | 0.007975 | 378.0 | 1809.0 | 0.006428 | 0.008397 | 0.568460 | 0.002140 | 0.006485 | 0.017868 |
| 23 | 23.0 | 2086 | 0.170662 | 0.007607 | 356.0 | 1730.0 | 0.006054 | 0.008030 | 0.561850 | 0.002178 | 0.006610 | 0.017868 |
| 24 | 24.0 | 975 | 0.177436 | 0.003555 | 173.0 | 802.0 | 0.002942 | 0.003723 | 0.582381 | 0.006774 | 0.020531 | 0.017868 |
| 25 | 25.0 | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.177436 | 0.582381 | 0.017868 |
| 26 | 999.0 | 35305 | 0.159921 | 0.128740 | 5646.0 | 29659.0 | 0.096017 | 0.137672 | 0.529117 | 0.159921 | 0.529117 | 0.017868 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0-1', '2-3', '4-6', '7-10', '11-15', '>=16, 'Missing' = 999.
# '>=17' will be the reference category
df_inputs_prepr['mths_since_recent_inq:Missing'] = np.where(df_inputs_prepr['mths_since_recent_inq'].isin([999]), 1, 0)
df_inputs_prepr['mths_since_recent_inq:0-1'] = np.where(df_inputs_prepr['mths_since_recent_inq'].isin(range(0, 2)), 1, 0)
df_inputs_prepr['mths_since_recent_inq:2-3'] = np.where(df_inputs_prepr['mths_since_recent_inq'].isin(range(2, 4)), 1, 0)
df_inputs_prepr['mths_since_recent_inq:4-6'] = np.where(df_inputs_prepr['mths_since_recent_inq'].isin(range(4, 7)), 1, 0)
df_inputs_prepr['mths_since_recent_inq:7-10'] = np.where(df_inputs_prepr['mths_since_recent_inq'].isin(range(7, 11)), 1, 0)
df_inputs_prepr['mths_since_recent_inq:11-15'] = np.where(df_inputs_prepr['mths_since_recent_inq'].isin(range(11, 16)), 1, 0)
df_inputs_prepr['mths_since_recent_inq:>=16'] = np.where(df_inputs_prepr['mths_since_recent_inq'].isin(range(16, 500)), 1, 0)
Variable: 'out_prncp'¶
# unique values
df_inputs_prepr['out_prncp'].unique()
array([ 0. , 17331.34, 357.91, ..., 13382.28, 24725.78, 29443.26])
df_inputs_prepr.loc[df_inputs_prepr['out_prncp'] == 0, : ]['out_prncp'].count()
269105
# one other category will be created for 'out_prncp' value = 0.
#********************************
# 'out_prncp'
# the categories of everyone with 'out_prncp' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['out_prncp'] != 0, : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['out_prncp_factor'] = pd.cut(df_inputs_prepr_temp['out_prncp'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'out_prncp_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\469751821.py:9: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr_temp['out_prncp_factor'] = pd.cut(df_inputs_prepr_temp['out_prncp'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\469751821.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['out_prncp_factor'] = pd.cut(df_inputs_prepr_temp['out_prncp'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| out_prncp_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-29.798, 791.883] | 131 | 1.0 | 0.025541 | 131.0 | 0.0 | 0.025541 | NaN | NaN | NaN | NaN | 0.0 |
| 1 | (791.883, 1574.435] | 223 | 1.0 | 0.043478 | 223.0 | 0.0 | 0.043478 | NaN | NaN | 0.0 | NaN | 0.0 |
| 2 | (1574.435, 2356.988] | 251 | 1.0 | 0.048937 | 251.0 | 0.0 | 0.048937 | NaN | NaN | 0.0 | NaN | 0.0 |
| 3 | (2356.988, 3139.54] | 218 | 1.0 | 0.042503 | 218.0 | 0.0 | 0.042503 | NaN | NaN | 0.0 | NaN | 0.0 |
| 4 | (3139.54, 3922.093] | 226 | 1.0 | 0.044063 | 226.0 | 0.0 | 0.044063 | NaN | NaN | 0.0 | NaN | 0.0 |
| 5 | (3922.093, 4704.646] | 266 | 1.0 | 0.051862 | 266.0 | 0.0 | 0.051862 | NaN | NaN | 0.0 | NaN | 0.0 |
| 6 | (4704.646, 5487.198] | 221 | 1.0 | 0.043088 | 221.0 | 0.0 | 0.043088 | NaN | NaN | 0.0 | NaN | 0.0 |
| 7 | (5487.198, 6269.751] | 212 | 1.0 | 0.041334 | 212.0 | 0.0 | 0.041334 | NaN | NaN | 0.0 | NaN | 0.0 |
| 8 | (6269.751, 7052.303] | 193 | 1.0 | 0.037629 | 193.0 | 0.0 | 0.037629 | NaN | NaN | 0.0 | NaN | 0.0 |
| 9 | (7052.303, 7834.856] | 191 | 1.0 | 0.037239 | 191.0 | 0.0 | 0.037239 | NaN | NaN | 0.0 | NaN | 0.0 |
| 10 | (7834.856, 8617.409] | 202 | 1.0 | 0.039384 | 202.0 | 0.0 | 0.039384 | NaN | NaN | 0.0 | NaN | 0.0 |
| 11 | (8617.409, 9399.961] | 237 | 1.0 | 0.046208 | 237.0 | 0.0 | 0.046208 | NaN | NaN | 0.0 | NaN | 0.0 |
| 12 | (9399.961, 10182.514] | 201 | 1.0 | 0.039189 | 201.0 | 0.0 | 0.039189 | NaN | NaN | 0.0 | NaN | 0.0 |
| 13 | (10182.514, 10965.066] | 154 | 1.0 | 0.030025 | 154.0 | 0.0 | 0.030025 | NaN | NaN | 0.0 | NaN | 0.0 |
| 14 | (10965.066, 11747.619] | 159 | 1.0 | 0.031000 | 159.0 | 0.0 | 0.031000 | NaN | NaN | 0.0 | NaN | 0.0 |
| 15 | (11747.619, 12530.172] | 161 | 1.0 | 0.031390 | 161.0 | 0.0 | 0.031390 | NaN | NaN | 0.0 | NaN | 0.0 |
| 16 | (12530.172, 13312.724] | 131 | 1.0 | 0.025541 | 131.0 | 0.0 | 0.025541 | NaN | NaN | 0.0 | NaN | 0.0 |
| 17 | (13312.724, 14095.277] | 157 | 1.0 | 0.030610 | 157.0 | 0.0 | 0.030610 | NaN | NaN | 0.0 | NaN | 0.0 |
| 18 | (14095.277, 14877.829] | 108 | 1.0 | 0.021057 | 108.0 | 0.0 | 0.021057 | NaN | NaN | 0.0 | NaN | 0.0 |
| 19 | (14877.829, 15660.382] | 104 | 1.0 | 0.020277 | 104.0 | 0.0 | 0.020277 | NaN | NaN | 0.0 | NaN | 0.0 |
| 20 | (15660.382, 16442.935] | 99 | 1.0 | 0.019302 | 99.0 | 0.0 | 0.019302 | NaN | NaN | 0.0 | NaN | 0.0 |
| 21 | (16442.935, 17225.487] | 109 | 1.0 | 0.021252 | 109.0 | 0.0 | 0.021252 | NaN | NaN | 0.0 | NaN | 0.0 |
| 22 | (17225.487, 18008.04] | 97 | 1.0 | 0.018912 | 97.0 | 0.0 | 0.018912 | NaN | NaN | 0.0 | NaN | 0.0 |
| 23 | (18008.04, 18790.592] | 99 | 1.0 | 0.019302 | 99.0 | 0.0 | 0.019302 | NaN | NaN | 0.0 | NaN | 0.0 |
| 24 | (18790.592, 19573.145] | 86 | 1.0 | 0.016767 | 86.0 | 0.0 | 0.016767 | NaN | NaN | 0.0 | NaN | 0.0 |
| 25 | (19573.145, 20355.698] | 60 | 1.0 | 0.011698 | 60.0 | 0.0 | 0.011698 | NaN | NaN | 0.0 | NaN | 0.0 |
| 26 | (20355.698, 21138.25] | 70 | 1.0 | 0.013648 | 70.0 | 0.0 | 0.013648 | NaN | NaN | 0.0 | NaN | 0.0 |
| 27 | (21138.25, 21920.803] | 68 | 1.0 | 0.013258 | 68.0 | 0.0 | 0.013258 | NaN | NaN | 0.0 | NaN | 0.0 |
| 28 | (21920.803, 22703.355] | 66 | 1.0 | 0.012868 | 66.0 | 0.0 | 0.012868 | NaN | NaN | 0.0 | NaN | 0.0 |
| 29 | (22703.355, 23485.908] | 53 | 1.0 | 0.010333 | 53.0 | 0.0 | 0.010333 | NaN | NaN | 0.0 | NaN | 0.0 |
| 30 | (23485.908, 24268.461] | 52 | 1.0 | 0.010138 | 52.0 | 0.0 | 0.010138 | NaN | NaN | 0.0 | NaN | 0.0 |
| 31 | (24268.461, 25051.013] | 59 | 1.0 | 0.011503 | 59.0 | 0.0 | 0.011503 | NaN | NaN | 0.0 | NaN | 0.0 |
| 32 | (25051.013, 25833.566] | 29 | 1.0 | 0.005654 | 29.0 | 0.0 | 0.005654 | NaN | NaN | 0.0 | NaN | 0.0 |
| 33 | (25833.566, 26616.118] | 45 | 1.0 | 0.008774 | 45.0 | 0.0 | 0.008774 | NaN | NaN | 0.0 | NaN | 0.0 |
| 34 | (26616.118, 27398.671] | 57 | 1.0 | 0.011113 | 57.0 | 0.0 | 0.011113 | NaN | NaN | 0.0 | NaN | 0.0 |
| 35 | (27398.671, 28181.224] | 42 | 1.0 | 0.008189 | 42.0 | 0.0 | 0.008189 | NaN | NaN | 0.0 | NaN | 0.0 |
| 36 | (28181.224, 28963.776] | 42 | 1.0 | 0.008189 | 42.0 | 0.0 | 0.008189 | NaN | NaN | 0.0 | NaN | 0.0 |
| 37 | (28963.776, 29746.329] | 43 | 1.0 | 0.008384 | 43.0 | 0.0 | 0.008384 | NaN | NaN | 0.0 | NaN | 0.0 |
| 38 | (29746.329, 30528.881] | 24 | 1.0 | 0.004679 | 24.0 | 0.0 | 0.004679 | NaN | NaN | 0.0 | NaN | 0.0 |
| 39 | (30528.881, 31311.434] | 28 | 1.0 | 0.005459 | 28.0 | 0.0 | 0.005459 | NaN | NaN | 0.0 | NaN | 0.0 |
| 40 | (31311.434, 32093.987] | 23 | 1.0 | 0.004484 | 23.0 | 0.0 | 0.004484 | NaN | NaN | 0.0 | NaN | 0.0 |
| 41 | (32093.987, 32876.539] | 20 | 1.0 | 0.003899 | 20.0 | 0.0 | 0.003899 | NaN | NaN | 0.0 | NaN | 0.0 |
| 42 | (32876.539, 33659.092] | 34 | 1.0 | 0.006629 | 34.0 | 0.0 | 0.006629 | NaN | NaN | 0.0 | NaN | 0.0 |
| 43 | (33659.092, 34441.644] | 23 | 1.0 | 0.004484 | 23.0 | 0.0 | 0.004484 | NaN | NaN | 0.0 | NaN | 0.0 |
| 44 | (34441.644, 35224.197] | 10 | 1.0 | 0.001950 | 10.0 | 0.0 | 0.001950 | NaN | NaN | 0.0 | NaN | 0.0 |
| 45 | (35224.197, 36006.75] | 10 | 1.0 | 0.001950 | 10.0 | 0.0 | 0.001950 | NaN | NaN | 0.0 | NaN | 0.0 |
| 46 | (36006.75, 36789.302] | 8 | 1.0 | 0.001560 | 8.0 | 0.0 | 0.001560 | NaN | NaN | 0.0 | NaN | 0.0 |
| 47 | (36789.302, 37571.855] | 14 | 1.0 | 0.002730 | 14.0 | 0.0 | 0.002730 | NaN | NaN | 0.0 | NaN | 0.0 |
| 48 | (37571.855, 38354.407] | 7 | 1.0 | 0.001365 | 7.0 | 0.0 | 0.001365 | NaN | NaN | 0.0 | NaN | 0.0 |
| 49 | (38354.407, 39136.96] | 6 | 1.0 | 0.001170 | 6.0 | 0.0 | 0.001170 | NaN | NaN | 0.0 | NaN | 0.0 |
plot_by_woe(df_temp.iloc[: 50, : ], 90)
# We plot the weight of evidence values.
# Categories: '=0', '>0'
df_inputs_prepr['out_prncp:=0'] = np.where((df_inputs_prepr['out_prncp'] == 0.), 1, 0)
df_inputs_prepr['out_prncp:>0'] = np.where((df_inputs_prepr['out_prncp'] > 0.), 1, 0)
Variable: 'last_pymnt_amnt'¶
# unique values
df_inputs_prepr['last_pymnt_amnt'].nunique()
193962
df_inputs_prepr['last_pymnt_amnt'].max()
42148.53
# one other category will be created for 'last_pymnt_amnt_factor' value > 10000.
#********************************
# 'last_pymnt_amnt_factor'
# the categories of everyone with 'last_pymnt_amnt_factor' less or equal 10000.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['last_pymnt_amnt'] <= 10000., : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['last_pymnt_amnt_factor'] = pd.cut(df_inputs_prepr_temp['last_pymnt_amnt'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'last_pymnt_amnt_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\462666328.py:9: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr_temp['last_pymnt_amnt_factor'] = pd.cut(df_inputs_prepr_temp['last_pymnt_amnt'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\462666328.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['last_pymnt_amnt_factor'] = pd.cut(df_inputs_prepr_temp['last_pymnt_amnt'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| last_pymnt_amnt_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-10.0, 200.0] | 30839 | 0.326859 | 0.140436 | 10080.0 | 20759.0 | 0.171572 | 0.129064 | 0.845591 | NaN | NaN | 0.705538 |
| 1 | (200.0, 400.0] | 39046 | 0.510885 | 0.177810 | 19948.0 | 19098.0 | 0.339535 | 0.118737 | 1.350552 | 0.184026 | 0.504960 | 0.705538 |
| 2 | (400.0, 600.0] | 24991 | 0.548037 | 0.113805 | 13696.0 | 11295.0 | 0.233119 | 0.070224 | 1.463178 | 0.037153 | 0.112626 | 0.705538 |
| 3 | (600.0, 800.0] | 15988 | 0.472167 | 0.072807 | 7549.0 | 8439.0 | 0.128491 | 0.052467 | 1.238079 | 0.075871 | 0.225099 | 0.705538 |
| 4 | (800.0, 1000.0] | 9179 | 0.445364 | 0.041800 | 4088.0 | 5091.0 | 0.069582 | 0.031652 | 1.162632 | 0.026802 | 0.075447 | 0.705538 |
| 5 | (1000.0, 1200.0] | 5887 | 0.316460 | 0.026809 | 1863.0 | 4024.0 | 0.031710 | 0.025018 | 0.818670 | 0.128904 | 0.343962 | 0.705538 |
| 6 | (1200.0, 1400.0] | 4087 | 0.232934 | 0.018612 | 952.0 | 3135.0 | 0.016204 | 0.019491 | 0.605056 | 0.083526 | 0.213614 | 0.705538 |
| 7 | (1400.0, 1600.0] | 3053 | 0.054045 | 0.013903 | 165.0 | 2888.0 | 0.002808 | 0.017955 | 0.145323 | 0.178888 | 0.459733 | 0.705538 |
| 8 | (1600.0, 1800.0] | 2748 | 0.035298 | 0.012514 | 97.0 | 2651.0 | 0.001651 | 0.016482 | 0.095467 | 0.018747 | 0.049856 | 0.705538 |
| 9 | (1800.0, 2000.0] | 2905 | 0.026506 | 0.013229 | 77.0 | 2828.0 | 0.001311 | 0.017582 | 0.071894 | 0.008792 | 0.023573 | 0.705538 |
| 10 | (2000.0, 2200.0] | 2630 | 0.018251 | 0.011977 | 48.0 | 2582.0 | 0.000817 | 0.016053 | 0.049642 | 0.008255 | 0.022252 | 0.705538 |
| 11 | (2200.0, 2400.0] | 2827 | 0.013442 | 0.012874 | 38.0 | 2789.0 | 0.000647 | 0.017340 | 0.036622 | 0.004809 | 0.013020 | 0.705538 |
| 12 | (2400.0, 2600.0] | 2846 | 0.008784 | 0.012960 | 25.0 | 2821.0 | 0.000426 | 0.017539 | 0.023972 | 0.004658 | 0.012650 | 0.705538 |
| 13 | (2600.0, 2800.0] | 2566 | 0.008574 | 0.011685 | 22.0 | 2544.0 | 0.000374 | 0.015817 | 0.023399 | 0.000211 | 0.000573 | 0.705538 |
| 14 | (2800.0, 3000.0] | 2587 | 0.004639 | 0.011781 | 12.0 | 2575.0 | 0.000204 | 0.016009 | 0.012678 | 0.003935 | 0.010722 | 0.705538 |
| 15 | (3000.0, 3200.0] | 2565 | 0.002729 | 0.011681 | 7.0 | 2558.0 | 0.000119 | 0.015904 | 0.007464 | 0.001910 | 0.005214 | 0.705538 |
| 16 | (3200.0, 3400.0] | 2455 | 0.001629 | 0.011180 | 4.0 | 2451.0 | 0.000068 | 0.015238 | 0.004458 | 0.001100 | 0.003006 | 0.705538 |
| 17 | (3400.0, 3600.0] | 2479 | 0.002824 | 0.011289 | 7.0 | 2472.0 | 0.000119 | 0.015369 | 0.007723 | 0.001194 | 0.003265 | 0.705538 |
| 18 | (3600.0, 3800.0] | 2603 | 0.001921 | 0.011854 | 5.0 | 2598.0 | 0.000085 | 0.016152 | 0.005255 | 0.000903 | 0.002467 | 0.705538 |
| 19 | (3800.0, 4000.0] | 2471 | 0.004856 | 0.011253 | 12.0 | 2459.0 | 0.000204 | 0.015288 | 0.013272 | 0.002935 | 0.008017 | 0.705538 |
| 20 | (4000.0, 4200.0] | 2523 | 0.002378 | 0.011489 | 6.0 | 2517.0 | 0.000102 | 0.015649 | 0.006505 | 0.002478 | 0.006767 | 0.705538 |
| 21 | (4200.0, 4400.0] | 2257 | 0.002215 | 0.010278 | 5.0 | 2252.0 | 0.000085 | 0.014001 | 0.006060 | 0.000163 | 0.000445 | 0.705538 |
| 22 | (4400.0, 4600.0] | 2299 | 0.000870 | 0.010469 | 2.0 | 2297.0 | 0.000034 | 0.014281 | 0.002381 | 0.001345 | 0.003679 | 0.705538 |
| 23 | (4600.0, 4800.0] | 2246 | 0.001781 | 0.010228 | 4.0 | 2242.0 | 0.000068 | 0.013939 | 0.004873 | 0.000911 | 0.002492 | 0.705538 |
| 24 | (4800.0, 5000.0] | 2361 | 0.001271 | 0.010752 | 3.0 | 2358.0 | 0.000051 | 0.014660 | 0.003477 | 0.000510 | 0.001395 | 0.705538 |
| 25 | (5000.0, 5200.0] | 2346 | 0.000426 | 0.010683 | 1.0 | 2345.0 | 0.000017 | 0.014579 | 0.001167 | 0.000844 | 0.002310 | 0.705538 |
| 26 | (5200.0, 5400.0] | 2027 | 0.001480 | 0.009231 | 3.0 | 2024.0 | 0.000051 | 0.012584 | 0.004050 | 0.001054 | 0.002883 | 0.705538 |
| 27 | (5400.0, 5600.0] | 2067 | 0.000484 | 0.009413 | 1.0 | 2066.0 | 0.000017 | 0.012845 | 0.001324 | 0.000996 | 0.002725 | 0.705538 |
| 28 | (5600.0, 5800.0] | 1971 | 0.000507 | 0.008976 | 1.0 | 1970.0 | 0.000017 | 0.012248 | 0.001389 | 0.000024 | 0.000064 | 0.705538 |
| 29 | (5800.0, 6000.0] | 2057 | 0.000486 | 0.009367 | 1.0 | 2056.0 | 0.000017 | 0.012783 | 0.001331 | 0.000021 | 0.000058 | 0.705538 |
| 30 | (6000.0, 6200.0] | 2054 | 0.000974 | 0.009354 | 2.0 | 2052.0 | 0.000034 | 0.012758 | 0.002665 | 0.000488 | 0.001334 | 0.705538 |
| 31 | (6200.0, 6400.0] | 1793 | 0.001673 | 0.008165 | 3.0 | 1790.0 | 0.000051 | 0.011129 | 0.004578 | 0.000699 | 0.001913 | 0.705538 |
| 32 | (6400.0, 6600.0] | 1756 | 0.001139 | 0.007997 | 2.0 | 1754.0 | 0.000034 | 0.010905 | 0.003117 | 0.000534 | 0.001461 | 0.705538 |
| 33 | (6600.0, 6800.0] | 1817 | 0.001101 | 0.008274 | 2.0 | 1815.0 | 0.000034 | 0.011284 | 0.003012 | 0.000038 | 0.000105 | 0.705538 |
| 34 | (6800.0, 7000.0] | 1786 | 0.001680 | 0.008133 | 3.0 | 1783.0 | 0.000051 | 0.011085 | 0.004596 | 0.000579 | 0.001584 | 0.705538 |
| 35 | (7000.0, 7200.0] | 1875 | 0.000000 | 0.008538 | 0.0 | 1875.0 | 0.000000 | 0.011657 | 0.000000 | 0.001680 | 0.004596 | 0.705538 |
| 36 | (7200.0, 7400.0] | 1702 | 0.001175 | 0.007751 | 2.0 | 1700.0 | 0.000034 | 0.010569 | 0.003216 | 0.001175 | 0.003216 | 0.705538 |
| 37 | (7400.0, 7600.0] | 1708 | 0.001756 | 0.007778 | 3.0 | 1705.0 | 0.000051 | 0.010600 | 0.004806 | 0.000581 | 0.001590 | 0.705538 |
| 38 | (7600.0, 7800.0] | 1551 | 0.000000 | 0.007063 | 0.0 | 1551.0 | 0.000000 | 0.009643 | 0.000000 | 0.001756 | 0.004806 | 0.705538 |
| 39 | (7800.0, 8000.0] | 1564 | 0.000639 | 0.007122 | 1.0 | 1563.0 | 0.000017 | 0.009718 | 0.001750 | 0.000639 | 0.001750 | 0.705538 |
| 40 | (8000.0, 8200.0] | 1742 | 0.000574 | 0.007933 | 1.0 | 1741.0 | 0.000017 | 0.010824 | 0.001571 | 0.000065 | 0.000179 | 0.705538 |
| 41 | (8200.0, 8400.0] | 1562 | 0.000640 | 0.007113 | 1.0 | 1561.0 | 0.000017 | 0.009705 | 0.001752 | 0.000066 | 0.000181 | 0.705538 |
| 42 | (8400.0, 8600.0] | 1593 | 0.000628 | 0.007254 | 1.0 | 1592.0 | 0.000017 | 0.009898 | 0.001718 | 0.000012 | 0.000034 | 0.705538 |
| 43 | (8600.0, 8800.0] | 1431 | 0.001398 | 0.006517 | 2.0 | 1429.0 | 0.000034 | 0.008884 | 0.003824 | 0.000770 | 0.002106 | 0.705538 |
| 44 | (8800.0, 9000.0] | 1455 | 0.000687 | 0.006626 | 1.0 | 1454.0 | 0.000017 | 0.009040 | 0.001881 | 0.000710 | 0.001943 | 0.705538 |
| 45 | (9000.0, 9200.0] | 1495 | 0.000669 | 0.006808 | 1.0 | 1494.0 | 0.000017 | 0.009289 | 0.001831 | 0.000018 | 0.000050 | 0.705538 |
| 46 | (9200.0, 9400.0] | 1458 | 0.000000 | 0.006640 | 0.0 | 1458.0 | 0.000000 | 0.009065 | 0.000000 | 0.000669 | 0.001831 | 0.705538 |
| 47 | (9400.0, 9600.0] | 1419 | 0.000705 | 0.006462 | 1.0 | 1418.0 | 0.000017 | 0.008816 | 0.001929 | 0.000705 | 0.001929 | 0.705538 |
| 48 | (9600.0, 9800.0] | 1462 | 0.000000 | 0.006658 | 0.0 | 1462.0 | 0.000000 | 0.009090 | 0.000000 | 0.000705 | 0.001929 | 0.705538 |
| 49 | (9800.0, 10000.0] | 1465 | 0.002048 | 0.006671 | 3.0 | 1462.0 | 0.000051 | 0.009090 | 0.005602 | 0.002048 | 0.005602 | 0.705538 |
#plot_by_woe(df_temp.iloc[7: , : ], 90)
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '<=200', '200-700', '700-1000', '1000-1500', '1500-2600', '2600-10K', ' >10K'
df_inputs_prepr['last_pymnt_amnt:<=200'] = np.where((df_inputs_prepr['last_pymnt_amnt'] <= 200), 1, 0)
df_inputs_prepr['last_pymnt_amnt:200-700'] = np.where((df_inputs_prepr['last_pymnt_amnt'] > 200) & (df_inputs_prepr['last_pymnt_amnt'] <= 700), 1, 0)
df_inputs_prepr['last_pymnt_amnt:700-1000'] = np.where((df_inputs_prepr['last_pymnt_amnt'] > 700) & (df_inputs_prepr['last_pymnt_amnt'] <= 1000), 1, 0)
df_inputs_prepr['last_pymnt_amnt:1000-1500'] = np.where((df_inputs_prepr['last_pymnt_amnt'] > 1000) & (df_inputs_prepr['last_pymnt_amnt'] <= 1500), 1, 0)
df_inputs_prepr['last_pymnt_amnt:1500-2600'] = np.where((df_inputs_prepr['last_pymnt_amnt'] > 1500) & (df_inputs_prepr['last_pymnt_amnt'] <= 2600), 1, 0)
df_inputs_prepr['last_pymnt_amnt:2600-10000'] = np.where((df_inputs_prepr['last_pymnt_amnt'] > 2600) & (df_inputs_prepr['last_pymnt_amnt'] <= 10000), 1, 0)
df_inputs_prepr['last_pymnt_amnt:>10000'] = np.where((df_inputs_prepr['last_pymnt_amnt'] > 10000), 1, 0)
Variable: 'principal_paid_ratio'¶
df_inputs_prepr['principal_paid_ratio'].nunique()
54695
df_inputs_prepr.loc[df_inputs_prepr['principal_paid_ratio'] >= 1., : ]['principal_paid_ratio'].count()
214732
# one other category will be created for 'principal_paid_ratio' value = 1 with a count of 858698.
#********************************
# 'principal_paid_ratio'
# the categories of everyone with 'principal_paid_ratio' < 1.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['principal_paid_ratio'] < 1., : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['principal_paid_ratio_factor'] = pd.cut(df_inputs_prepr_temp['principal_paid_ratio'], 20)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'principal_paid_ratio_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1212135044.py:9: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr_temp['principal_paid_ratio_factor'] = pd.cut(df_inputs_prepr_temp['principal_paid_ratio'], 20) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1212135044.py:9: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['principal_paid_ratio_factor'] = pd.cut(df_inputs_prepr_temp['principal_paid_ratio'], 20) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| principal_paid_ratio_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.001, 0.05] | 4479 | 1.000000 | 0.075275 | 4479.0 | 0.0 | 0.076193 | 0.000000 | inf | NaN | NaN | inf |
| 1 | (0.05, 0.1] | 6358 | 1.000000 | 0.106854 | 6358.0 | 0.0 | 0.108157 | 0.000000 | inf | 0.000000 | NaN | inf |
| 2 | (0.1, 0.15] | 6557 | 1.000000 | 0.110198 | 6557.0 | 0.0 | 0.111542 | 0.000000 | inf | 0.000000 | NaN | inf |
| 3 | (0.15, 0.2] | 5993 | 0.999833 | 0.100719 | 5992.0 | 1.0 | 0.101931 | 0.001395 | 4.305204 | 0.000167 | inf | inf |
| 4 | (0.2, 0.25] | 5382 | 0.999814 | 0.090451 | 5381.0 | 1.0 | 0.091537 | 0.001395 | 4.199185 | 0.000019 | 0.106020 | inf |
| 5 | (0.25, 0.3] | 4927 | 1.000000 | 0.082804 | 4927.0 | 0.0 | 0.083814 | 0.000000 | inf | 0.000186 | inf | inf |
| 6 | (0.3, 0.35] | 4240 | 1.000000 | 0.071258 | 4240.0 | 0.0 | 0.072127 | 0.000000 | inf | 0.000000 | NaN | inf |
| 7 | (0.35, 0.4] | 3571 | 1.000000 | 0.060015 | 3571.0 | 0.0 | 0.060747 | 0.000000 | inf | 0.000000 | NaN | inf |
| 8 | (0.4, 0.45] | 3241 | 0.999383 | 0.054469 | 3239.0 | 2.0 | 0.055099 | 0.002789 | 3.032692 | 0.000617 | inf | inf |
| 9 | (0.45, 0.5] | 2611 | 0.999234 | 0.043881 | 2609.0 | 2.0 | 0.044382 | 0.002789 | 2.827963 | 0.000149 | 0.204729 | inf |
| 10 | (0.5, 0.55] | 2332 | 0.996141 | 0.039192 | 2323.0 | 9.0 | 0.039517 | 0.012552 | 1.422669 | 0.003093 | 1.405293 | inf |
| 11 | (0.55, 0.6] | 1908 | 0.998428 | 0.032066 | 1905.0 | 3.0 | 0.032406 | 0.004184 | 2.168492 | 0.002287 | 0.745823 | inf |
| 12 | (0.6, 0.65] | 1744 | 0.934060 | 0.029310 | 1629.0 | 115.0 | 0.027711 | 0.160391 | 0.159371 | 0.064368 | 2.009121 | inf |
| 13 | (0.65, 0.7] | 1528 | 0.958770 | 0.025680 | 1465.0 | 63.0 | 0.024921 | 0.087866 | 0.249691 | 0.024710 | 0.090320 | inf |
| 14 | (0.7, 0.75] | 1234 | 0.960292 | 0.020739 | 1185.0 | 49.0 | 0.020158 | 0.068340 | 0.258486 | 0.001522 | 0.008795 | inf |
| 15 | (0.75, 0.8] | 889 | 0.970754 | 0.014941 | 863.0 | 26.0 | 0.014681 | 0.036262 | 0.339928 | 0.010462 | 0.081442 | inf |
| 16 | (0.8, 0.85] | 813 | 0.977860 | 0.013663 | 795.0 | 18.0 | 0.013524 | 0.025105 | 0.430938 | 0.007106 | 0.091010 | inf |
| 17 | (0.85, 0.9] | 573 | 0.963351 | 0.009630 | 552.0 | 21.0 | 0.009390 | 0.029289 | 0.278091 | 0.014509 | 0.152847 | inf |
| 18 | (0.9, 0.95] | 486 | 0.979424 | 0.008168 | 476.0 | 10.0 | 0.008097 | 0.013947 | 0.457790 | 0.016073 | 0.179699 | inf |
| 19 | (0.95, 1.0] | 636 | 0.375786 | 0.010689 | 239.0 | 397.0 | 0.004066 | 0.553696 | 0.007316 | 0.603638 | 0.450474 | inf |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# Categories: '<= 0.3', '0.3 - 0.45', '0.45 - 0.6', '0.6 - 1', ' = 1'
df_inputs_prepr['principal_paid_ratio:<=0.3'] = np.where((df_inputs_prepr['principal_paid_ratio'] <= 0.3), 1, 0)
df_inputs_prepr['principal_paid_ratio:0.3-0.45'] = np.where((df_inputs_prepr['principal_paid_ratio'] > 0.3) & (df_inputs_prepr['principal_paid_ratio'] <= 0.45), 1, 0)
df_inputs_prepr['principal_paid_ratio:0.45-0.6'] = np.where((df_inputs_prepr['principal_paid_ratio'] > 0.45) & (df_inputs_prepr['principal_paid_ratio'] <= 0.6), 1, 0)
df_inputs_prepr['principal_paid_ratio:0.6-1'] = np.where((df_inputs_prepr['principal_paid_ratio'] > 0.6) & (df_inputs_prepr['principal_paid_ratio'] <= 1.), 1, 0)
df_inputs_prepr['principal_paid_ratio:=1'] = np.where((df_inputs_prepr['principal_paid_ratio'] >= 1.), 1, 0)
Variable: 'fico_range_high'¶
df_inputs_prepr['fico_range_high'].unique()
array([679., 699., 664., 709., 684., 704., 714., 674., 739., 754., 694.,
749., 779., 669., 784., 689., 719., 724., 729., 819., 794., 769.,
774., 734., 744., 789., 759., 764., 809., 814., 799., 824., 834.,
804., 829., 844., 839., 850.])
# 'fico_range_high'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'fico_range_high', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| fico_range_high | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 664.0 | 24699 | 0.282238 | 0.090065 | 6971.0 | 17728.0 | 0.118550 | 0.082290 | 0.892258 | NaN | NaN | 0.051934 |
| 1 | 669.0 | 23877 | 0.275495 | 0.087068 | 6578.0 | 17299.0 | 0.111867 | 0.080299 | 0.872601 | 0.006743 | 0.019656 | 0.051934 |
| 2 | 674.0 | 23894 | 0.261781 | 0.087130 | 6255.0 | 17639.0 | 0.106374 | 0.081877 | 0.832555 | 0.013714 | 0.040046 | 0.051934 |
| 3 | 679.0 | 21241 | 0.250741 | 0.077456 | 5326.0 | 15915.0 | 0.090575 | 0.073875 | 0.800234 | 0.011040 | 0.032321 | 0.051934 |
| 4 | 684.0 | 21156 | 0.243288 | 0.077146 | 5147.0 | 16009.0 | 0.087531 | 0.074311 | 0.778361 | 0.007454 | 0.021874 | 0.051934 |
| 5 | 689.0 | 18350 | 0.231281 | 0.066914 | 4244.0 | 14106.0 | 0.072174 | 0.065478 | 0.743020 | 0.012007 | 0.035341 | 0.051934 |
| 6 | 694.0 | 17828 | 0.222796 | 0.065010 | 3972.0 | 13856.0 | 0.067549 | 0.064317 | 0.717958 | 0.008485 | 0.025062 | 0.051934 |
| 7 | 699.0 | 16181 | 0.219640 | 0.059004 | 3554.0 | 12627.0 | 0.060440 | 0.058612 | 0.708618 | 0.003155 | 0.009340 | 0.051934 |
| 8 | 704.0 | 14750 | 0.201627 | 0.053786 | 2974.0 | 11776.0 | 0.050577 | 0.054662 | 0.655058 | 0.018013 | 0.053560 | 0.051934 |
| 9 | 709.0 | 13447 | 0.192459 | 0.049035 | 2588.0 | 10859.0 | 0.044012 | 0.050406 | 0.627625 | 0.009168 | 0.027433 | 0.051934 |
| 10 | 714.0 | 11568 | 0.174187 | 0.042183 | 2015.0 | 9553.0 | 0.034268 | 0.044343 | 0.572546 | 0.018272 | 0.055079 | 0.051934 |
| 11 | 719.0 | 10455 | 0.164610 | 0.038124 | 1721.0 | 8734.0 | 0.029268 | 0.040542 | 0.543437 | 0.009577 | 0.029110 | 0.051934 |
| 12 | 724.0 | 8922 | 0.160278 | 0.032534 | 1430.0 | 7492.0 | 0.024319 | 0.034777 | 0.530210 | 0.004332 | 0.013227 | 0.051934 |
| 13 | 729.0 | 6979 | 0.160768 | 0.025449 | 1122.0 | 5857.0 | 0.019081 | 0.027187 | 0.531708 | 0.000490 | 0.001498 | 0.051934 |
| 14 | 734.0 | 6186 | 0.150824 | 0.022557 | 933.0 | 5253.0 | 0.015867 | 0.024384 | 0.501210 | 0.009944 | 0.030498 | 0.051934 |
| 15 | 739.0 | 4915 | 0.143235 | 0.017923 | 704.0 | 4211.0 | 0.011972 | 0.019547 | 0.477785 | 0.007589 | 0.023425 | 0.051934 |
| 16 | 744.0 | 4417 | 0.132443 | 0.016107 | 585.0 | 3832.0 | 0.009949 | 0.017788 | 0.444240 | 0.010792 | 0.033545 | 0.051934 |
| 17 | 749.0 | 3528 | 0.118197 | 0.012865 | 417.0 | 3111.0 | 0.007092 | 0.014441 | 0.399502 | 0.014246 | 0.044738 | 0.051934 |
| 18 | 754.0 | 3296 | 0.119842 | 0.012019 | 395.0 | 2901.0 | 0.006717 | 0.013466 | 0.404696 | 0.001645 | 0.005194 | 0.051934 |
| 19 | 759.0 | 2772 | 0.128788 | 0.010108 | 357.0 | 2415.0 | 0.006071 | 0.011210 | 0.432813 | 0.008946 | 0.028117 | 0.051934 |
| 20 | 764.0 | 2270 | 0.117181 | 0.008278 | 266.0 | 2004.0 | 0.004524 | 0.009302 | 0.396288 | 0.011607 | 0.036525 | 0.051934 |
| 21 | 769.0 | 2082 | 0.109030 | 0.007592 | 227.0 | 1855.0 | 0.003860 | 0.008611 | 0.370413 | 0.008151 | 0.025875 | 0.051934 |
| 22 | 774.0 | 1914 | 0.115987 | 0.006979 | 222.0 | 1692.0 | 0.003775 | 0.007854 | 0.392512 | 0.006958 | 0.022100 | 0.051934 |
| 23 | 779.0 | 1664 | 0.100962 | 0.006068 | 168.0 | 1496.0 | 0.002857 | 0.006944 | 0.344603 | 0.015026 | 0.047909 | 0.051934 |
| 24 | 784.0 | 1505 | 0.087043 | 0.005488 | 131.0 | 1374.0 | 0.002228 | 0.006378 | 0.299588 | 0.013918 | 0.045015 | 0.051934 |
| 25 | 789.0 | 1246 | 0.079454 | 0.004544 | 99.0 | 1147.0 | 0.001684 | 0.005324 | 0.274764 | 0.007589 | 0.024824 | 0.051934 |
| 26 | 794.0 | 1061 | 0.071631 | 0.003869 | 76.0 | 985.0 | 0.001292 | 0.004572 | 0.248952 | 0.007824 | 0.025812 | 0.051934 |
| 27 | 799.0 | 878 | 0.079727 | 0.003202 | 70.0 | 808.0 | 0.001190 | 0.003751 | 0.275659 | 0.008096 | 0.026707 | 0.051934 |
| 28 | 804.0 | 773 | 0.087969 | 0.002819 | 68.0 | 705.0 | 0.001156 | 0.003272 | 0.302603 | 0.008242 | 0.026944 | 0.051934 |
| 29 | 809.0 | 690 | 0.088406 | 0.002516 | 61.0 | 629.0 | 0.001037 | 0.002920 | 0.304024 | 0.000437 | 0.001421 | 0.051934 |
| 30 | 814.0 | 477 | 0.056604 | 0.001739 | 27.0 | 450.0 | 0.000459 | 0.002089 | 0.198704 | 0.031802 | 0.105320 | 0.051934 |
| 31 | 819.0 | 397 | 0.080605 | 0.001448 | 32.0 | 365.0 | 0.000544 | 0.001694 | 0.278540 | 0.024001 | 0.079836 | 0.051934 |
| 32 | 824.0 | 302 | 0.096026 | 0.001101 | 29.0 | 273.0 | 0.000493 | 0.001267 | 0.328716 | 0.015422 | 0.050175 | 0.051934 |
| 33 | 829.0 | 217 | 0.082949 | 0.000791 | 18.0 | 199.0 | 0.000306 | 0.000924 | 0.286222 | 0.013077 | 0.042493 | 0.051934 |
| 34 | 834.0 | 122 | 0.081967 | 0.000445 | 10.0 | 112.0 | 0.000170 | 0.000520 | 0.283007 | 0.000982 | 0.003215 | 0.051934 |
| 35 | 839.0 | 75 | 0.013333 | 0.000273 | 1.0 | 74.0 | 0.000017 | 0.000343 | 0.048323 | 0.068634 | 0.234685 | 0.051934 |
| 36 | 844.0 | 54 | 0.055556 | 0.000197 | 3.0 | 51.0 | 0.000051 | 0.000237 | 0.195164 | 0.042222 | 0.146842 | 0.051934 |
| 37 | 850.0 | 46 | 0.130435 | 0.000168 | 6.0 | 40.0 | 0.000102 | 0.000186 | 0.437966 | 0.074879 | 0.242802 | 0.051934 |
plot_by_woe(df_temp.iloc[2: , : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '< 680', '680-700', '700-720', '720-750', '750-795', '> 795.
# '> 795' will be the reference category
df_inputs_prepr['fico_range_high:<=680'] = np.where((df_inputs_prepr['fico_range_high'] <= 680), 1, 0)
df_inputs_prepr['fico_range_high:680-700'] = np.where((df_inputs_prepr['fico_range_high'] > 680) & (df_inputs_prepr['fico_range_high'] <= 700), 1, 0)
df_inputs_prepr['fico_range_high:700-720'] = np.where((df_inputs_prepr['fico_range_high'] > 700) & (df_inputs_prepr['fico_range_high'] <= 720), 1, 0)
df_inputs_prepr['fico_range_high:720-750'] = np.where((df_inputs_prepr['fico_range_high'] > 720) & (df_inputs_prepr['fico_range_high'] <= 750), 1, 0)
df_inputs_prepr['fico_range_high:750-795'] = np.where((df_inputs_prepr['fico_range_high'] > 750) & (df_inputs_prepr['fico_range_high'] <= 795), 1, 0)
df_inputs_prepr['fico_range_high:>795'] = np.where((df_inputs_prepr['fico_range_high'] > 795), 1, 0)
Variable: 'last_fico_range_high'¶
df_inputs_prepr['last_fico_range_high'].unique()
array([574., 709., 714., 749., 664., 799., 589., 689., 499., 694., 579.,
784., 774., 669., 554., 684., 529., 659., 734., 769., 739., 569.,
834., 634., 594., 674., 614., 724., 789., 719., 729., 704., 544.,
629., 584., 524., 599., 509., 699., 779., 619., 549., 744., 654.,
649., 794., 754., 539., 624., 804., 759., 604., 519., 534., 679.,
764., 644., 819., 609., 839., 559., 829., 639., 504., 564., 809.,
814., 824., 514., 0., 844., 850.])
# 'fico_range_high'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'last_fico_range_high', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| last_fico_range_high | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 45 | 0.088889 | 0.000164 | 4.0 | 41.0 | 0.000068 | 0.000190 | 0.305595 | NaN | NaN | 1.778145 |
| 1 | 499.0 | 7207 | 0.872485 | 0.026280 | 6288.0 | 919.0 | 0.106935 | 0.004266 | 3.260698 | 0.783596 | 2.955103 | 1.778145 |
| 2 | 504.0 | 1510 | 0.851656 | 0.005506 | 1286.0 | 224.0 | 0.021870 | 0.001040 | 3.092563 | 0.020829 | 0.168135 | 1.778145 |
| 3 | 509.0 | 1679 | 0.854080 | 0.006123 | 1434.0 | 245.0 | 0.024387 | 0.001137 | 3.111013 | 0.002424 | 0.018450 | 1.778145 |
| 4 | 514.0 | 1928 | 0.821577 | 0.007030 | 1584.0 | 344.0 | 0.026938 | 0.001597 | 2.883123 | 0.032503 | 0.227890 | 1.778145 |
| 5 | 519.0 | 1964 | 0.815173 | 0.007162 | 1601.0 | 363.0 | 0.027227 | 0.001685 | 2.842498 | 0.006404 | 0.040625 | 1.778145 |
| 6 | 524.0 | 2277 | 0.819060 | 0.008303 | 1865.0 | 412.0 | 0.031717 | 0.001912 | 2.867012 | 0.003887 | 0.024515 | 1.778145 |
| 7 | 529.0 | 2252 | 0.804618 | 0.008212 | 1812.0 | 440.0 | 0.030815 | 0.002042 | 2.778056 | 0.014442 | 0.088956 | 1.778145 |
| 8 | 534.0 | 2427 | 0.791100 | 0.008850 | 1920.0 | 507.0 | 0.032652 | 0.002353 | 2.699636 | 0.013518 | 0.078421 | 1.778145 |
| 9 | 539.0 | 2544 | 0.768475 | 0.009277 | 1955.0 | 589.0 | 0.033247 | 0.002734 | 2.577216 | 0.022625 | 0.122420 | 1.778145 |
| 10 | 544.0 | 2901 | 0.772148 | 0.010579 | 2240.0 | 661.0 | 0.038094 | 0.003068 | 2.596412 | 0.003673 | 0.019196 | 1.778145 |
| 11 | 549.0 | 2580 | 0.760465 | 0.009408 | 1962.0 | 618.0 | 0.033366 | 0.002869 | 2.536179 | 0.011682 | 0.060233 | 1.778145 |
| 12 | 554.0 | 2850 | 0.759649 | 0.010393 | 2165.0 | 685.0 | 0.036818 | 0.003180 | 2.532059 | 0.000816 | 0.004119 | 1.778145 |
| 13 | 559.0 | 2678 | 0.725915 | 0.009765 | 1944.0 | 734.0 | 0.033060 | 0.003407 | 2.370550 | 0.033734 | 0.161510 | 1.778145 |
| 14 | 564.0 | 2956 | 0.727673 | 0.010779 | 2151.0 | 805.0 | 0.036580 | 0.003737 | 2.378578 | 0.001758 | 0.008028 | 1.778145 |
| 15 | 569.0 | 2614 | 0.730298 | 0.009532 | 1909.0 | 705.0 | 0.032465 | 0.003272 | 2.390645 | 0.002626 | 0.012067 | 1.778145 |
| 16 | 574.0 | 2826 | 0.673036 | 0.010305 | 1902.0 | 924.0 | 0.032346 | 0.004289 | 2.144934 | 0.057262 | 0.245710 | 1.778145 |
| 17 | 579.0 | 2673 | 0.667789 | 0.009747 | 1785.0 | 888.0 | 0.030356 | 0.004122 | 2.123997 | 0.005247 | 0.020938 | 1.778145 |
| 18 | 584.0 | 2770 | 0.632852 | 0.010101 | 1753.0 | 1017.0 | 0.029812 | 0.004721 | 1.989938 | 0.034937 | 0.134058 | 1.778145 |
| 19 | 589.0 | 2553 | 0.624363 | 0.009310 | 1594.0 | 959.0 | 0.027108 | 0.004452 | 1.958627 | 0.008488 | 0.031311 | 1.778145 |
| 20 | 594.0 | 2772 | 0.613997 | 0.010108 | 1702.0 | 1070.0 | 0.028945 | 0.004967 | 1.920981 | 0.010366 | 0.037646 | 1.778145 |
| 21 | 599.0 | 2517 | 0.571712 | 0.009178 | 1439.0 | 1078.0 | 0.024472 | 0.005004 | 1.773354 | 0.042285 | 0.147627 | 1.778145 |
| 22 | 604.0 | 2636 | 0.554628 | 0.009612 | 1462.0 | 1174.0 | 0.024863 | 0.005450 | 1.716037 | 0.017084 | 0.057317 | 1.778145 |
| 23 | 609.0 | 2657 | 0.488521 | 0.009689 | 1298.0 | 1359.0 | 0.022074 | 0.006308 | 1.503908 | 0.066107 | 0.212129 | 1.778145 |
| 24 | 614.0 | 2692 | 0.495542 | 0.009816 | 1334.0 | 1358.0 | 0.022686 | 0.006304 | 1.525825 | 0.007021 | 0.021917 | 1.778145 |
| 25 | 619.0 | 2565 | 0.445614 | 0.009353 | 1143.0 | 1422.0 | 0.019438 | 0.006601 | 1.372414 | 0.049928 | 0.153411 | 1.778145 |
| 26 | 624.0 | 2845 | 0.387698 | 0.010374 | 1103.0 | 1742.0 | 0.018758 | 0.008086 | 1.199896 | 0.057916 | 0.172517 | 1.778145 |
| 27 | 629.0 | 2723 | 0.369078 | 0.009929 | 1005.0 | 1718.0 | 0.017091 | 0.007975 | 1.145239 | 0.018619 | 0.054658 | 1.778145 |
| 28 | 634.0 | 3175 | 0.319055 | 0.011578 | 1013.0 | 2162.0 | 0.017227 | 0.010036 | 0.999385 | 0.050023 | 0.145854 | 1.778145 |
| 29 | 639.0 | 3013 | 0.281779 | 0.010987 | 849.0 | 2164.0 | 0.014438 | 0.010045 | 0.890920 | 0.037276 | 0.108466 | 1.778145 |
| 30 | 644.0 | 3623 | 0.248413 | 0.013211 | 900.0 | 2723.0 | 0.015306 | 0.012640 | 0.793406 | 0.033366 | 0.097514 | 1.778145 |
| 31 | 649.0 | 3780 | 0.199471 | 0.013784 | 754.0 | 3026.0 | 0.012823 | 0.014046 | 0.648617 | 0.048942 | 0.144788 | 1.778145 |
| 32 | 654.0 | 4347 | 0.159650 | 0.015851 | 694.0 | 3653.0 | 0.011802 | 0.016957 | 0.528290 | 0.039821 | 0.120327 | 1.778145 |
| 33 | 659.0 | 4949 | 0.121237 | 0.018047 | 600.0 | 4349.0 | 0.010204 | 0.020187 | 0.409093 | 0.038414 | 0.119197 | 1.778145 |
| 34 | 664.0 | 5470 | 0.108775 | 0.019946 | 595.0 | 4875.0 | 0.010119 | 0.022629 | 0.369601 | 0.012461 | 0.039492 | 1.778145 |
| 35 | 669.0 | 5999 | 0.081014 | 0.021875 | 486.0 | 5513.0 | 0.008265 | 0.025590 | 0.279882 | 0.027762 | 0.089720 | 1.778145 |
| 36 | 674.0 | 6961 | 0.065508 | 0.025383 | 456.0 | 6505.0 | 0.007755 | 0.030195 | 0.228588 | 0.015506 | 0.051294 | 1.778145 |
| 37 | 679.0 | 7069 | 0.049370 | 0.025777 | 349.0 | 6720.0 | 0.005935 | 0.031193 | 0.174182 | 0.016137 | 0.054406 | 1.778145 |
| 38 | 684.0 | 8107 | 0.043296 | 0.029562 | 351.0 | 7756.0 | 0.005969 | 0.036002 | 0.153408 | 0.006075 | 0.020773 | 1.778145 |
| 39 | 689.0 | 8036 | 0.035590 | 0.029303 | 286.0 | 7750.0 | 0.004864 | 0.035974 | 0.126810 | 0.007706 | 0.026598 | 1.778145 |
| 40 | 694.0 | 8714 | 0.031099 | 0.031776 | 271.0 | 8443.0 | 0.004609 | 0.039191 | 0.111179 | 0.004490 | 0.015631 | 1.778145 |
| 41 | 699.0 | 8472 | 0.022073 | 0.030893 | 187.0 | 8285.0 | 0.003180 | 0.038458 | 0.079451 | 0.009027 | 0.031728 | 1.778145 |
| 42 | 704.0 | 8625 | 0.022377 | 0.031451 | 193.0 | 8432.0 | 0.003282 | 0.039140 | 0.080527 | 0.000304 | 0.001076 | 1.778145 |
| 43 | 709.0 | 8537 | 0.019913 | 0.031130 | 170.0 | 8367.0 | 0.002891 | 0.038838 | 0.071798 | 0.002463 | 0.008729 | 1.778145 |
| 44 | 714.0 | 8304 | 0.017100 | 0.030281 | 142.0 | 8162.0 | 0.002415 | 0.037887 | 0.061791 | 0.002813 | 0.010007 | 1.778145 |
| 45 | 719.0 | 8215 | 0.014973 | 0.029956 | 123.0 | 8092.0 | 0.002092 | 0.037562 | 0.054193 | 0.002128 | 0.007597 | 1.778145 |
| 46 | 724.0 | 8010 | 0.012110 | 0.029209 | 97.0 | 7913.0 | 0.001650 | 0.036731 | 0.043931 | 0.002863 | 0.010262 | 1.778145 |
| 47 | 729.0 | 7246 | 0.012007 | 0.026423 | 87.0 | 7159.0 | 0.001480 | 0.033231 | 0.043560 | 0.000103 | 0.000371 | 1.778145 |
| 48 | 734.0 | 7276 | 0.011545 | 0.026532 | 84.0 | 7192.0 | 0.001429 | 0.033384 | 0.041900 | 0.000462 | 0.001660 | 1.778145 |
| 49 | 739.0 | 6271 | 0.009727 | 0.022867 | 61.0 | 6210.0 | 0.001037 | 0.028826 | 0.035355 | 0.001817 | 0.006545 | 1.778145 |
| 50 | 744.0 | 5907 | 0.008295 | 0.021540 | 49.0 | 5858.0 | 0.000833 | 0.027192 | 0.030185 | 0.001432 | 0.005170 | 1.778145 |
| 51 | 749.0 | 5128 | 0.007995 | 0.018699 | 41.0 | 5087.0 | 0.000697 | 0.023613 | 0.029101 | 0.000300 | 0.001084 | 1.778145 |
| 52 | 754.0 | 5041 | 0.005356 | 0.018382 | 27.0 | 5014.0 | 0.000459 | 0.023274 | 0.019537 | 0.002639 | 0.009564 | 1.778145 |
| 53 | 759.0 | 4716 | 0.007846 | 0.017197 | 37.0 | 4679.0 | 0.000629 | 0.021719 | 0.028559 | 0.002490 | 0.009023 | 1.778145 |
| 54 | 764.0 | 4137 | 0.006768 | 0.015086 | 28.0 | 4109.0 | 0.000476 | 0.019073 | 0.024659 | 0.001077 | 0.003901 | 1.778145 |
| 55 | 769.0 | 3954 | 0.007334 | 0.014418 | 29.0 | 3925.0 | 0.000493 | 0.018219 | 0.026709 | 0.000566 | 0.002050 | 1.778145 |
| 56 | 774.0 | 3566 | 0.006169 | 0.013003 | 22.0 | 3544.0 | 0.000374 | 0.016451 | 0.022488 | 0.001165 | 0.004221 | 1.778145 |
| 57 | 779.0 | 3711 | 0.008084 | 0.013532 | 30.0 | 3681.0 | 0.000510 | 0.017087 | 0.029422 | 0.001915 | 0.006934 | 1.778145 |
| 58 | 784.0 | 3385 | 0.006204 | 0.012343 | 21.0 | 3364.0 | 0.000357 | 0.015615 | 0.022613 | 0.001880 | 0.006809 | 1.778145 |
| 59 | 789.0 | 2942 | 0.005438 | 0.010728 | 16.0 | 2926.0 | 0.000272 | 0.013582 | 0.019836 | 0.000765 | 0.002777 | 1.778145 |
| 60 | 794.0 | 2908 | 0.007909 | 0.010604 | 23.0 | 2885.0 | 0.000391 | 0.013392 | 0.028789 | 0.002471 | 0.008954 | 1.778145 |
| 61 | 799.0 | 2423 | 0.007429 | 0.008836 | 18.0 | 2405.0 | 0.000306 | 0.011164 | 0.027051 | 0.000480 | 0.001738 | 1.778145 |
| 62 | 804.0 | 2224 | 0.008094 | 0.008110 | 18.0 | 2206.0 | 0.000306 | 0.010240 | 0.029456 | 0.000665 | 0.002405 | 1.778145 |
| 63 | 809.0 | 2019 | 0.006439 | 0.007362 | 13.0 | 2006.0 | 0.000221 | 0.009312 | 0.023465 | 0.001655 | 0.005991 | 1.778145 |
| 64 | 814.0 | 1553 | 0.007083 | 0.005663 | 11.0 | 1542.0 | 0.000187 | 0.007158 | 0.025800 | 0.000644 | 0.002334 | 1.778145 |
| 65 | 819.0 | 1321 | 0.010598 | 0.004817 | 14.0 | 1307.0 | 0.000238 | 0.006067 | 0.038493 | 0.003515 | 0.012694 | 1.778145 |
| 66 | 824.0 | 922 | 0.003254 | 0.003362 | 3.0 | 919.0 | 0.000051 | 0.004266 | 0.011889 | 0.007344 | 0.026604 | 1.778145 |
| 67 | 829.0 | 729 | 0.008230 | 0.002658 | 6.0 | 723.0 | 0.000102 | 0.003356 | 0.029951 | 0.004977 | 0.018062 | 1.778145 |
| 68 | 834.0 | 423 | 0.009456 | 0.001542 | 4.0 | 419.0 | 0.000068 | 0.001945 | 0.034378 | 0.001226 | 0.004427 | 1.778145 |
| 69 | 839.0 | 216 | 0.009259 | 0.000788 | 2.0 | 214.0 | 0.000034 | 0.000993 | 0.033667 | 0.000197 | 0.000711 | 1.778145 |
| 70 | 844.0 | 115 | 0.000000 | 0.000419 | 0.0 | 115.0 | 0.000000 | 0.000534 | 0.000000 | 0.009259 | 0.033667 | 1.778145 |
| 71 | 850.0 | 54 | 0.037037 | 0.000197 | 2.0 | 52.0 | 0.000034 | 0.000241 | 0.131827 | 0.037037 | 0.131827 | 1.778145 |
plot_by_woe(df_temp.iloc[1: 50, : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '< 520', '520-550', '550-580', '580-610', '610-640', '640-670', '> 670.
# '> 670' will be the reference category
df_inputs_prepr['last_fico_range_high:<=520'] = np.where((df_inputs_prepr['last_fico_range_high'] <= 520), 1, 0)
df_inputs_prepr['last_fico_range_high:520-550'] = np.where((df_inputs_prepr['last_fico_range_high'] > 520) & (df_inputs_prepr['last_fico_range_high'] <= 550), 1, 0)
df_inputs_prepr['last_fico_range_high:550-580'] = np.where((df_inputs_prepr['last_fico_range_high'] > 550) & (df_inputs_prepr['last_fico_range_high'] <= 580), 1, 0)
df_inputs_prepr['last_fico_range_high:580-610'] = np.where((df_inputs_prepr['last_fico_range_high'] > 580) & (df_inputs_prepr['last_fico_range_high'] <= 610), 1, 0)
df_inputs_prepr['last_fico_range_high:610-640'] = np.where((df_inputs_prepr['last_fico_range_high'] > 610) & (df_inputs_prepr['last_fico_range_high'] <= 640), 1, 0)
df_inputs_prepr['last_fico_range_high:640-670'] = np.where((df_inputs_prepr['last_fico_range_high'] > 640) & (df_inputs_prepr['last_fico_range_high'] <= 670), 1, 0)
df_inputs_prepr['last_fico_range_high:>670'] = np.where((df_inputs_prepr['last_fico_range_high'] > 670), 1, 0)
Variable: 'mo_sin_rcnt_rev_tl_op'¶
df_inputs_prepr['mo_sin_rcnt_rev_tl_op'].nunique()
224
# 'mo_sin_rcnt_rev_tl_op'
# the categories of everyone with 'mo_sin_rcnt_rev_tl_op' less or equal 140.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 150., : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['mo_sin_rcnt_rev_tl_op_factor'] = pd.cut(df_inputs_prepr_temp['mo_sin_rcnt_rev_tl_op'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'mo_sin_rcnt_rev_tl_op_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3692458730.py:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr_temp['mo_sin_rcnt_rev_tl_op_factor'] = pd.cut(df_inputs_prepr_temp['mo_sin_rcnt_rev_tl_op'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3692458730.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['mo_sin_rcnt_rev_tl_op_factor'] = pd.cut(df_inputs_prepr_temp['mo_sin_rcnt_rev_tl_op'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| mo_sin_rcnt_rev_tl_op_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.15, 3.0] | 63052 | 0.247066 | 0.230036 | 15578.0 | 47474.0 | 0.265094 | 0.220469 | 0.789553 | NaN | NaN | 0.011909 |
| 1 | (3.0, 6.0] | 48792 | 0.228234 | 0.178011 | 11136.0 | 37656.0 | 0.189504 | 0.174874 | 0.734125 | 0.018832 | 0.055428 | 0.011909 |
| 2 | (6.0, 9.0] | 49393 | 0.202822 | 0.180203 | 10018.0 | 39375.0 | 0.170479 | 0.182857 | 0.658713 | 0.025412 | 0.075412 | 0.011909 |
| 3 | (9.0, 12.0] | 26416 | 0.217633 | 0.096375 | 5749.0 | 20667.0 | 0.097832 | 0.095977 | 0.702763 | 0.014811 | 0.044049 | 0.011909 |
| 4 | (12.0, 15.0] | 19549 | 0.214282 | 0.071322 | 4189.0 | 15360.0 | 0.071285 | 0.071332 | 0.692821 | 0.003351 | 0.009942 | 0.011909 |
| 5 | (15.0, 18.0] | 13643 | 0.198197 | 0.049775 | 2704.0 | 10939.0 | 0.046015 | 0.050801 | 0.644895 | 0.016085 | 0.047925 | 0.011909 |
| 6 | (18.0, 21.0] | 10149 | 0.200414 | 0.037027 | 2034.0 | 8115.0 | 0.034613 | 0.037686 | 0.651522 | 0.002217 | 0.006627 | 0.011909 |
| 7 | (21.0, 24.0] | 8080 | 0.185767 | 0.029479 | 1501.0 | 6579.0 | 0.025543 | 0.030553 | 0.607602 | 0.014647 | 0.043921 | 0.011909 |
| 8 | (24.0, 27.0] | 6122 | 0.183110 | 0.022335 | 1121.0 | 5001.0 | 0.019076 | 0.023225 | 0.599596 | 0.002657 | 0.008005 | 0.011909 |
| 9 | (27.0, 30.0] | 4560 | 0.174561 | 0.016637 | 796.0 | 3764.0 | 0.013546 | 0.017480 | 0.573759 | 0.008549 | 0.025837 | 0.011909 |
| 10 | (30.0, 33.0] | 3609 | 0.178720 | 0.013167 | 645.0 | 2964.0 | 0.010976 | 0.013765 | 0.586344 | 0.004158 | 0.012585 | 0.011909 |
| 11 | (33.0, 36.0] | 2999 | 0.166722 | 0.010941 | 500.0 | 2499.0 | 0.008509 | 0.011605 | 0.549948 | 0.011998 | 0.036395 | 0.011909 |
| 12 | (36.0, 39.0] | 2478 | 0.173123 | 0.009041 | 429.0 | 2049.0 | 0.007300 | 0.009516 | 0.569400 | 0.006401 | 0.019452 | 0.011909 |
| 13 | (39.0, 42.0] | 1960 | 0.167347 | 0.007151 | 328.0 | 1632.0 | 0.005582 | 0.007579 | 0.551850 | 0.005777 | 0.017550 | 0.011909 |
| 14 | (42.0, 45.0] | 1642 | 0.163216 | 0.005991 | 268.0 | 1374.0 | 0.004561 | 0.006381 | 0.539259 | 0.004131 | 0.012591 | 0.011909 |
| 15 | (45.0, 48.0] | 1374 | 0.155750 | 0.005013 | 214.0 | 1160.0 | 0.003642 | 0.005387 | 0.516416 | 0.007466 | 0.022843 | 0.011909 |
| 16 | (48.0, 51.0] | 1191 | 0.161209 | 0.004345 | 192.0 | 999.0 | 0.003267 | 0.004639 | 0.533131 | 0.005459 | 0.016715 | 0.011909 |
| 17 | (51.0, 54.0] | 1013 | 0.151037 | 0.003696 | 153.0 | 860.0 | 0.002604 | 0.003994 | 0.501935 | 0.010173 | 0.031196 | 0.011909 |
| 18 | (54.0, 57.0] | 862 | 0.160093 | 0.003145 | 138.0 | 724.0 | 0.002348 | 0.003362 | 0.529718 | 0.009056 | 0.027784 | 0.011909 |
| 19 | (57.0, 60.0] | 766 | 0.159269 | 0.002795 | 122.0 | 644.0 | 0.002076 | 0.002991 | 0.527198 | 0.000824 | 0.002520 | 0.011909 |
| 20 | (60.0, 63.0] | 696 | 0.122126 | 0.002539 | 85.0 | 611.0 | 0.001446 | 0.002837 | 0.411958 | 0.037142 | 0.115240 | 0.011909 |
| 21 | (63.0, 66.0] | 619 | 0.150242 | 0.002258 | 93.0 | 526.0 | 0.001583 | 0.002443 | 0.499489 | 0.028116 | 0.087532 | 0.011909 |
| 22 | (66.0, 69.0] | 547 | 0.166362 | 0.001996 | 91.0 | 456.0 | 0.001549 | 0.002118 | 0.548851 | 0.016120 | 0.049362 | 0.011909 |
| 23 | (69.0, 72.0] | 548 | 0.135036 | 0.001999 | 74.0 | 474.0 | 0.001259 | 0.002201 | 0.452394 | 0.031325 | 0.096457 | 0.011909 |
| 24 | (72.0, 75.0] | 486 | 0.144033 | 0.001773 | 70.0 | 416.0 | 0.001191 | 0.001932 | 0.480324 | 0.008996 | 0.027929 | 0.011909 |
| 25 | (75.0, 78.0] | 443 | 0.155756 | 0.001616 | 69.0 | 374.0 | 0.001174 | 0.001737 | 0.516436 | 0.011723 | 0.036112 | 0.011909 |
| 26 | (78.0, 81.0] | 350 | 0.134286 | 0.001277 | 47.0 | 303.0 | 0.000800 | 0.001407 | 0.450055 | 0.021470 | 0.066381 | 0.011909 |
| 27 | (81.0, 84.0] | 334 | 0.113772 | 0.001219 | 38.0 | 296.0 | 0.000647 | 0.001375 | 0.385551 | 0.020513 | 0.064504 | 0.011909 |
| 28 | (84.0, 87.0] | 333 | 0.156156 | 0.001215 | 52.0 | 281.0 | 0.000885 | 0.001305 | 0.517663 | 0.042384 | 0.132112 | 0.011909 |
| 29 | (87.0, 90.0] | 294 | 0.142857 | 0.001073 | 42.0 | 252.0 | 0.000715 | 0.001170 | 0.476685 | 0.013299 | 0.040978 | 0.011909 |
| 30 | (90.0, 93.0] | 249 | 0.148594 | 0.000908 | 37.0 | 212.0 | 0.000630 | 0.000985 | 0.494412 | 0.005737 | 0.017727 | 0.011909 |
| 31 | (93.0, 96.0] | 221 | 0.167421 | 0.000806 | 37.0 | 184.0 | 0.000630 | 0.000854 | 0.552075 | 0.018826 | 0.057664 | 0.011909 |
| 32 | (96.0, 99.0] | 175 | 0.142857 | 0.000638 | 25.0 | 150.0 | 0.000425 | 0.000697 | 0.476685 | 0.024564 | 0.075390 | 0.011909 |
| 33 | (99.0, 102.0] | 171 | 0.116959 | 0.000624 | 20.0 | 151.0 | 0.000340 | 0.000701 | 0.395647 | 0.025898 | 0.081038 | 0.011909 |
| 34 | (102.0, 105.0] | 154 | 0.168831 | 0.000562 | 26.0 | 128.0 | 0.000442 | 0.000594 | 0.556366 | 0.051872 | 0.160719 | 0.011909 |
| 35 | (105.0, 108.0] | 146 | 0.164384 | 0.000533 | 24.0 | 122.0 | 0.000408 | 0.000567 | 0.542822 | 0.004448 | 0.013544 | 0.011909 |
| 36 | (108.0, 111.0] | 109 | 0.165138 | 0.000398 | 18.0 | 91.0 | 0.000306 | 0.000423 | 0.545121 | 0.000754 | 0.002299 | 0.011909 |
| 37 | (111.0, 114.0] | 91 | 0.120879 | 0.000332 | 11.0 | 80.0 | 0.000187 | 0.000372 | 0.408027 | 0.044258 | 0.137093 | 0.011909 |
| 38 | (114.0, 117.0] | 81 | 0.209877 | 0.000296 | 17.0 | 64.0 | 0.000289 | 0.000297 | 0.679729 | 0.088997 | 0.271702 | 0.011909 |
| 39 | (117.0, 120.0] | 67 | 0.149254 | 0.000244 | 10.0 | 57.0 | 0.000170 | 0.000265 | 0.496444 | 0.060623 | 0.183285 | 0.011909 |
| 40 | (120.0, 123.0] | 55 | 0.200000 | 0.000201 | 11.0 | 44.0 | 0.000187 | 0.000204 | 0.650286 | 0.050746 | 0.153842 | 0.011909 |
| 41 | (123.0, 126.0] | 50 | 0.200000 | 0.000182 | 10.0 | 40.0 | 0.000170 | 0.000186 | 0.650286 | 0.000000 | 0.000000 | 0.011909 |
| 42 | (126.0, 129.0] | 50 | 0.180000 | 0.000182 | 9.0 | 41.0 | 0.000153 | 0.000190 | 0.590212 | 0.020000 | 0.060074 | 0.011909 |
| 43 | (129.0, 132.0] | 33 | 0.151515 | 0.000120 | 5.0 | 28.0 | 0.000085 | 0.000130 | 0.503407 | 0.028485 | 0.086804 | 0.011909 |
| 44 | (132.0, 135.0] | 32 | 0.093750 | 0.000117 | 3.0 | 29.0 | 0.000051 | 0.000135 | 0.321410 | 0.057765 | 0.181997 | 0.011909 |
| 45 | (135.0, 138.0] | 31 | 0.290323 | 0.000113 | 9.0 | 22.0 | 0.000153 | 0.000102 | 0.915912 | 0.196573 | 0.594502 | 0.011909 |
| 46 | (138.0, 141.0] | 33 | 0.090909 | 0.000120 | 3.0 | 30.0 | 0.000051 | 0.000139 | 0.312205 | 0.199413 | 0.603707 | 0.011909 |
| 47 | (141.0, 144.0] | 16 | 0.437500 | 0.000058 | 7.0 | 9.0 | 0.000119 | 0.000042 | 1.348087 | 0.346591 | 1.035881 | 0.011909 |
| 48 | (144.0, 147.0] | 19 | 0.263158 | 0.000069 | 5.0 | 14.0 | 0.000085 | 0.000065 | 0.836683 | 0.174342 | 0.511403 | 0.011909 |
| 49 | (147.0, 150.0] | 13 | 0.076923 | 0.000047 | 1.0 | 12.0 | 0.000017 | 0.000056 | 0.266481 | 0.186235 | 0.570202 | 0.011909 |
plot_by_woe(df_temp.iloc[13 : 47, : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0-3', '3-6', '6-9', '9-20', '20-37', '37-63', '63-80', '80-140', '> 140'.
# '> 140' will be the reference category
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:0-3'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 3), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:3-6'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 3) & (df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 6), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:6-9'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 6) & (df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 9), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:9-20'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 9) & (df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 20), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:20-37'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 20) & (df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 37), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:37-63'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 37) & (df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 63), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:63-80'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 63) & (df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 80), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:80-140'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 80) & (df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] <= 140), 1, 0)
df_inputs_prepr['mo_sin_rcnt_rev_tl_op:>140'] = np.where((df_inputs_prepr['mo_sin_rcnt_rev_tl_op'] > 140), 1, 0)
Variable: 'mo_sin_rcnt_tl'¶
df_inputs_prepr['mo_sin_rcnt_tl'].nunique()
153
# 'mo_sin_rcnt_tl'
# the categories of everyone with 'mo_sin_rcnt_tl' less or equal 50.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['mo_sin_rcnt_tl'] <= 50, : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
#df_inputs_prepr_temp
df_inputs_prepr_temp['mo_sin_rcnt_tl_factor'] = pd.cut(df_inputs_prepr_temp['mo_sin_rcnt_tl'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'mo_sin_rcnt_tl_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2033739020.py:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr_temp['mo_sin_rcnt_tl_factor'] = pd.cut(df_inputs_prepr_temp['mo_sin_rcnt_tl'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\2033739020.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['mo_sin_rcnt_tl_factor'] = pd.cut(df_inputs_prepr_temp['mo_sin_rcnt_tl'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| mo_sin_rcnt_tl_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.05, 1.0] | 27045 | 0.263154 | 0.099212 | 7117.0 | 19928.0 | 0.121571 | 0.093097 | 0.835452 | NaN | NaN | 0.014119 |
| 1 | (1.0, 2.0] | 29629 | 0.240845 | 0.108691 | 7136.0 | 22493.0 | 0.121895 | 0.105079 | 0.770122 | 0.022309 | 0.065330 | 0.014119 |
| 2 | (2.0, 3.0] | 28294 | 0.234997 | 0.103793 | 6649.0 | 21645.0 | 0.113577 | 0.101118 | 0.752929 | 0.005848 | 0.017194 | 0.014119 |
| 3 | (3.0, 4.0] | 24875 | 0.232523 | 0.091251 | 5784.0 | 19091.0 | 0.098801 | 0.089187 | 0.745645 | 0.002474 | 0.007284 | 0.014119 |
| 4 | (4.0, 5.0] | 21449 | 0.218985 | 0.078683 | 4697.0 | 16752.0 | 0.080233 | 0.078260 | 0.705677 | 0.013538 | 0.039968 | 0.014119 |
| 5 | (5.0, 6.0] | 32034 | 0.192296 | 0.117513 | 6160.0 | 25874.0 | 0.105224 | 0.120874 | 0.626217 | 0.026689 | 0.079460 | 0.014119 |
| 6 | (6.0, 7.0] | 16581 | 0.216875 | 0.060826 | 3596.0 | 12985.0 | 0.061426 | 0.060661 | 0.699429 | 0.024579 | 0.073213 | 0.014119 |
| 7 | (7.0, 8.0] | 13901 | 0.206028 | 0.050994 | 2864.0 | 11037.0 | 0.048922 | 0.051561 | 0.667224 | 0.010846 | 0.032205 | 0.014119 |
| 8 | (8.0, 9.0] | 11458 | 0.200820 | 0.042032 | 2301.0 | 9157.0 | 0.039305 | 0.042778 | 0.651705 | 0.005208 | 0.015519 | 0.014119 |
| 9 | (9.0, 10.0] | 9786 | 0.199980 | 0.035899 | 1957.0 | 7829.0 | 0.033429 | 0.036574 | 0.649196 | 0.000841 | 0.002509 | 0.014119 |
| 10 | (10.0, 11.0] | 8448 | 0.194010 | 0.030991 | 1639.0 | 6809.0 | 0.027997 | 0.031809 | 0.631352 | 0.005969 | 0.017843 | 0.014119 |
| 11 | (11.0, 12.0] | 7155 | 0.195667 | 0.026247 | 1400.0 | 5755.0 | 0.023914 | 0.026885 | 0.636311 | 0.001657 | 0.004958 | 0.014119 |
| 12 | (12.0, 13.0] | 6205 | 0.191942 | 0.022762 | 1191.0 | 5014.0 | 0.020344 | 0.023424 | 0.625157 | 0.003725 | 0.011154 | 0.014119 |
| 13 | (13.0, 14.0] | 5106 | 0.189581 | 0.018731 | 968.0 | 4138.0 | 0.016535 | 0.019331 | 0.618076 | 0.002361 | 0.007080 | 0.014119 |
| 14 | (14.0, 15.0] | 3988 | 0.181294 | 0.014630 | 723.0 | 3265.0 | 0.012350 | 0.015253 | 0.593154 | 0.008287 | 0.024923 | 0.014119 |
| 15 | (15.0, 16.0] | 3402 | 0.165197 | 0.012480 | 562.0 | 2840.0 | 0.009600 | 0.013267 | 0.544397 | 0.016097 | 0.048757 | 0.014119 |
| 16 | (16.0, 17.0] | 2879 | 0.176103 | 0.010561 | 507.0 | 2372.0 | 0.008660 | 0.011081 | 0.577482 | 0.010906 | 0.033085 | 0.014119 |
| 17 | (17.0, 18.0] | 2458 | 0.155004 | 0.009017 | 381.0 | 2077.0 | 0.006508 | 0.009703 | 0.513263 | 0.021099 | 0.064219 | 0.014119 |
| 18 | (18.0, 19.0] | 2193 | 0.169175 | 0.008045 | 371.0 | 1822.0 | 0.006337 | 0.008512 | 0.556490 | 0.014171 | 0.043227 | 0.014119 |
| 19 | (19.0, 20.0] | 1788 | 0.154922 | 0.006559 | 277.0 | 1511.0 | 0.004732 | 0.007059 | 0.513011 | 0.014253 | 0.043480 | 0.014119 |
| 20 | (20.0, 21.0] | 1601 | 0.185509 | 0.005873 | 297.0 | 1304.0 | 0.005073 | 0.006092 | 0.605845 | 0.030587 | 0.092834 | 0.014119 |
| 21 | (21.0, 22.0] | 1437 | 0.176061 | 0.005271 | 253.0 | 1184.0 | 0.004322 | 0.005531 | 0.577356 | 0.009448 | 0.028488 | 0.014119 |
| 22 | (22.0, 23.0] | 1237 | 0.178658 | 0.004538 | 221.0 | 1016.0 | 0.003775 | 0.004746 | 0.585202 | 0.002597 | 0.007846 | 0.014119 |
| 23 | (23.0, 24.0] | 1164 | 0.152062 | 0.004270 | 177.0 | 987.0 | 0.003023 | 0.004611 | 0.504236 | 0.026596 | 0.080967 | 0.014119 |
| 24 | (24.0, 25.0] | 907 | 0.158765 | 0.003327 | 144.0 | 763.0 | 0.002460 | 0.003564 | 0.524776 | 0.006703 | 0.020541 | 0.014119 |
| 25 | (25.0, 26.0] | 840 | 0.167857 | 0.003081 | 141.0 | 699.0 | 0.002409 | 0.003265 | 0.552488 | 0.009092 | 0.027712 | 0.014119 |
| 26 | (26.0, 27.0] | 763 | 0.140236 | 0.002799 | 107.0 | 656.0 | 0.001828 | 0.003065 | 0.467755 | 0.027621 | 0.084733 | 0.014119 |
| 27 | (27.0, 28.0] | 658 | 0.165653 | 0.002414 | 109.0 | 549.0 | 0.001862 | 0.002565 | 0.545787 | 0.025418 | 0.078032 | 0.014119 |
| 28 | (28.0, 29.0] | 521 | 0.145873 | 0.001911 | 76.0 | 445.0 | 0.001298 | 0.002079 | 0.485185 | 0.019780 | 0.060602 | 0.014119 |
| 29 | (29.0, 30.0] | 475 | 0.162105 | 0.001742 | 77.0 | 398.0 | 0.001315 | 0.001859 | 0.534976 | 0.016232 | 0.049791 | 0.014119 |
| 30 | (30.0, 31.0] | 477 | 0.161426 | 0.001750 | 77.0 | 400.0 | 0.001315 | 0.001869 | 0.532902 | 0.000680 | 0.002074 | 0.014119 |
| 31 | (31.0, 32.0] | 440 | 0.150000 | 0.001614 | 66.0 | 374.0 | 0.001127 | 0.001747 | 0.497898 | 0.011426 | 0.035004 | 0.014119 |
| 32 | (32.0, 33.0] | 388 | 0.164948 | 0.001423 | 64.0 | 324.0 | 0.001093 | 0.001514 | 0.543641 | 0.014948 | 0.045743 | 0.014119 |
| 33 | (33.0, 34.0] | 364 | 0.192308 | 0.001335 | 70.0 | 294.0 | 0.001196 | 0.001373 | 0.626253 | 0.027359 | 0.082612 | 0.014119 |
| 34 | (34.0, 35.0] | 294 | 0.108844 | 0.001079 | 32.0 | 262.0 | 0.000547 | 0.001224 | 0.369210 | 0.083464 | 0.257043 | 0.014119 |
| 35 | (35.0, 36.0] | 275 | 0.156364 | 0.001009 | 43.0 | 232.0 | 0.000735 | 0.001084 | 0.517428 | 0.047520 | 0.148218 | 0.014119 |
| 36 | (36.0, 37.0] | 258 | 0.124031 | 0.000946 | 32.0 | 226.0 | 0.000547 | 0.001056 | 0.417216 | 0.032333 | 0.100212 | 0.014119 |
| 37 | (37.0, 38.0] | 219 | 0.136986 | 0.000803 | 30.0 | 189.0 | 0.000512 | 0.000883 | 0.457673 | 0.012955 | 0.040457 | 0.014119 |
| 38 | (38.0, 39.0] | 187 | 0.160428 | 0.000686 | 30.0 | 157.0 | 0.000512 | 0.000733 | 0.529856 | 0.023442 | 0.072184 | 0.014119 |
| 39 | (39.0, 40.0] | 182 | 0.159341 | 0.000668 | 29.0 | 153.0 | 0.000495 | 0.000715 | 0.526535 | 0.001087 | 0.003321 | 0.014119 |
| 40 | (40.0, 41.0] | 159 | 0.100629 | 0.000583 | 16.0 | 143.0 | 0.000273 | 0.000668 | 0.342962 | 0.058712 | 0.183573 | 0.014119 |
| 41 | (41.0, 42.0] | 162 | 0.172840 | 0.000594 | 28.0 | 134.0 | 0.000478 | 0.000626 | 0.567606 | 0.072211 | 0.224644 | 0.014119 |
| 42 | (42.0, 43.0] | 152 | 0.125000 | 0.000558 | 19.0 | 133.0 | 0.000325 | 0.000621 | 0.420257 | 0.047840 | 0.147349 | 0.014119 |
| 43 | (43.0, 44.0] | 128 | 0.140625 | 0.000470 | 18.0 | 110.0 | 0.000307 | 0.000514 | 0.468960 | 0.015625 | 0.048703 | 0.014119 |
| 44 | (44.0, 45.0] | 116 | 0.112069 | 0.000426 | 13.0 | 103.0 | 0.000222 | 0.000481 | 0.379461 | 0.028556 | 0.089500 | 0.014119 |
| 45 | (45.0, 46.0] | 121 | 0.206612 | 0.000444 | 25.0 | 96.0 | 0.000427 | 0.000448 | 0.668960 | 0.094543 | 0.289499 | 0.014119 |
| 46 | (46.0, 47.0] | 113 | 0.168142 | 0.000415 | 19.0 | 94.0 | 0.000325 | 0.000439 | 0.553352 | 0.038470 | 0.115607 | 0.014119 |
| 47 | (47.0, 48.0] | 88 | 0.136364 | 0.000323 | 12.0 | 76.0 | 0.000205 | 0.000355 | 0.455738 | 0.031778 | 0.097614 | 0.014119 |
| 48 | (48.0, 49.0] | 93 | 0.182796 | 0.000341 | 17.0 | 76.0 | 0.000290 | 0.000355 | 0.597679 | 0.046432 | 0.141941 | 0.014119 |
| 49 | (49.0, 50.0] | 106 | 0.188679 | 0.000389 | 20.0 | 86.0 | 0.000342 | 0.000402 | 0.615370 | 0.005884 | 0.017691 | 0.014119 |
plot_by_woe(df_temp.iloc[ 8: , : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0-2', '2-5', '5-6', '6-10', '10-15', '15-20', '20-50', '> 50'.
# '> 50' will be the reference category
df_inputs_prepr['mo_sin_rcnt_tl:0-2'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] <= 2), 1, 0)
df_inputs_prepr['mo_sin_rcnt_tl:2-5'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] > 2) & (df_inputs_prepr['mo_sin_rcnt_tl'] <= 5), 1, 0)
df_inputs_prepr['mo_sin_rcnt_tl:5-6'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] > 5) & (df_inputs_prepr['mo_sin_rcnt_tl'] <= 6), 1, 0)
df_inputs_prepr['mo_sin_rcnt_tl:6-10'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] > 6) & (df_inputs_prepr['mo_sin_rcnt_tl'] <= 10), 1, 0)
df_inputs_prepr['mo_sin_rcnt_tl:10-15'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] > 10) & (df_inputs_prepr['mo_sin_rcnt_tl'] <= 15), 1, 0)
df_inputs_prepr['mo_sin_rcnt_tl:15-20'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] > 15) & (df_inputs_prepr['mo_sin_rcnt_tl'] <= 20), 1, 0)
df_inputs_prepr['mo_sin_rcnt_tl:20-50'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] > 20) & (df_inputs_prepr['mo_sin_rcnt_tl'] <= 50), 1, 0)
df_inputs_prepr['mo_sin_rcnt_tl:>50'] = np.where((df_inputs_prepr['mo_sin_rcnt_tl'] > 50), 1, 0)
Variable: 'mths_since_rcnt_il'¶
df_inputs_prepr['mths_since_rcnt_il'].nunique()
271
df_inputs_prepr.loc[df_inputs_prepr['mths_since_rcnt_il'] >= 999., : ]['mths_since_rcnt_il'].count()
164353
There are 660399 missing value filled with '999'
# 'mths_since_rcnt_il'
# the categories of everyone with 'mths_since_rcnt_il' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['mths_since_rcnt_il'] <= 100, : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# df_inputs_prepr_temp
df_inputs_prepr_temp['mths_since_rcnt_il_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_rcnt_il'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'mths_since_rcnt_il_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3212778370.py:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr_temp['mths_since_rcnt_il_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_rcnt_il'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3212778370.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['mths_since_rcnt_il_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_rcnt_il'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| mths_since_rcnt_il_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.1, 2.0] | 7109 | 0.293572 | 0.066417 | 2087.0 | 5022.0 | 0.076669 | 0.062921 | 0.796832 | NaN | NaN | 0.005064 |
| 1 | (2.0, 4.0] | 11130 | 0.277987 | 0.103984 | 3094.0 | 8036.0 | 0.113662 | 0.100683 | 0.755612 | 0.015584 | 0.041220 | 0.005064 |
| 2 | (4.0, 6.0] | 10576 | 0.265696 | 0.098808 | 2810.0 | 7766.0 | 0.103229 | 0.097300 | 0.723160 | 0.012292 | 0.032451 | 0.005064 |
| 3 | (6.0, 8.0] | 10703 | 0.268429 | 0.099994 | 2873.0 | 7830.0 | 0.105544 | 0.098102 | 0.730374 | 0.002733 | 0.007213 | 0.005064 |
| 4 | (8.0, 10.0] | 9182 | 0.258332 | 0.085784 | 2372.0 | 6810.0 | 0.087139 | 0.085322 | 0.703735 | 0.010098 | 0.026639 | 0.005064 |
| 5 | (10.0, 12.0] | 8133 | 0.249477 | 0.075984 | 2029.0 | 6104.0 | 0.074538 | 0.076477 | 0.680390 | 0.008854 | 0.023344 | 0.005064 |
| 6 | (12.0, 14.0] | 7957 | 0.244565 | 0.074339 | 1946.0 | 6011.0 | 0.071489 | 0.075312 | 0.667440 | 0.004913 | 0.012950 | 0.005064 |
| 7 | (14.0, 16.0] | 6037 | 0.235547 | 0.056402 | 1422.0 | 4615.0 | 0.052239 | 0.057821 | 0.643673 | 0.009017 | 0.023768 | 0.005064 |
| 8 | (16.0, 18.0] | 5003 | 0.244453 | 0.046741 | 1223.0 | 3780.0 | 0.044929 | 0.047360 | 0.667147 | 0.008906 | 0.023474 | 0.005064 |
| 9 | (18.0, 20.0] | 4167 | 0.236621 | 0.038931 | 986.0 | 3181.0 | 0.036222 | 0.039855 | 0.646503 | 0.007832 | 0.020644 | 0.005064 |
| 10 | (20.0, 22.0] | 3632 | 0.238987 | 0.033933 | 868.0 | 2764.0 | 0.031887 | 0.034630 | 0.652738 | 0.002366 | 0.006236 | 0.005064 |
| 11 | (22.0, 24.0] | 2937 | 0.236636 | 0.027439 | 695.0 | 2242.0 | 0.025532 | 0.028090 | 0.646542 | 0.002351 | 0.006196 | 0.005064 |
| 12 | (24.0, 26.0] | 2438 | 0.249795 | 0.022777 | 609.0 | 1829.0 | 0.022372 | 0.022915 | 0.681227 | 0.013159 | 0.034685 | 0.005064 |
| 13 | (26.0, 28.0] | 1965 | 0.240712 | 0.018358 | 473.0 | 1492.0 | 0.017376 | 0.018693 | 0.657287 | 0.009082 | 0.023940 | 0.005064 |
| 14 | (28.0, 30.0] | 1734 | 0.233564 | 0.016200 | 405.0 | 1329.0 | 0.014878 | 0.016651 | 0.638444 | 0.007148 | 0.018843 | 0.005064 |
| 15 | (30.0, 32.0] | 1541 | 0.229721 | 0.014397 | 354.0 | 1187.0 | 0.013005 | 0.014872 | 0.628313 | 0.003843 | 0.010131 | 0.005064 |
| 16 | (32.0, 34.0] | 1349 | 0.235730 | 0.012603 | 318.0 | 1031.0 | 0.011682 | 0.012917 | 0.644154 | 0.006009 | 0.015841 | 0.005064 |
| 17 | (34.0, 36.0] | 1161 | 0.235142 | 0.010847 | 273.0 | 888.0 | 0.010029 | 0.011126 | 0.642604 | 0.000588 | 0.001550 | 0.005064 |
| 18 | (36.0, 38.0] | 919 | 0.250272 | 0.008586 | 230.0 | 689.0 | 0.008449 | 0.008632 | 0.682485 | 0.015130 | 0.039881 | 0.005064 |
| 19 | (38.0, 40.0] | 812 | 0.242611 | 0.007586 | 197.0 | 615.0 | 0.007237 | 0.007705 | 0.662291 | 0.007661 | 0.020194 | 0.005064 |
| 20 | (40.0, 42.0] | 751 | 0.222370 | 0.007016 | 167.0 | 584.0 | 0.006135 | 0.007317 | 0.608930 | 0.020241 | 0.053360 | 0.005064 |
| 21 | (42.0, 44.0] | 689 | 0.217707 | 0.006437 | 150.0 | 539.0 | 0.005510 | 0.006753 | 0.596629 | 0.004663 | 0.012301 | 0.005064 |
| 22 | (44.0, 46.0] | 576 | 0.218750 | 0.005381 | 126.0 | 450.0 | 0.004629 | 0.005638 | 0.599381 | 0.001043 | 0.002752 | 0.005064 |
| 23 | (46.0, 48.0] | 540 | 0.242593 | 0.005045 | 131.0 | 409.0 | 0.004812 | 0.005124 | 0.662242 | 0.023843 | 0.062862 | 0.005064 |
| 24 | (48.0, 50.0] | 494 | 0.238866 | 0.004615 | 118.0 | 376.0 | 0.004335 | 0.004711 | 0.652421 | 0.003726 | 0.009822 | 0.005064 |
| 25 | (50.0, 52.0] | 417 | 0.211031 | 0.003896 | 88.0 | 329.0 | 0.003233 | 0.004122 | 0.579011 | 0.027835 | 0.073410 | 0.005064 |
| 26 | (52.0, 54.0] | 428 | 0.224299 | 0.003999 | 96.0 | 332.0 | 0.003527 | 0.004160 | 0.614017 | 0.013268 | 0.035006 | 0.005064 |
| 27 | (54.0, 56.0] | 351 | 0.210826 | 0.003279 | 74.0 | 277.0 | 0.002718 | 0.003471 | 0.578470 | 0.013473 | 0.035547 | 0.005064 |
| 28 | (56.0, 58.0] | 334 | 0.251497 | 0.003120 | 84.0 | 250.0 | 0.003086 | 0.003132 | 0.685714 | 0.040671 | 0.107244 | 0.005064 |
| 29 | (58.0, 60.0] | 318 | 0.223270 | 0.002971 | 71.0 | 247.0 | 0.002608 | 0.003095 | 0.611304 | 0.028227 | 0.074410 | 0.005064 |
| 30 | (60.0, 62.0] | 311 | 0.254019 | 0.002906 | 79.0 | 232.0 | 0.002902 | 0.002907 | 0.692364 | 0.030749 | 0.081060 | 0.005064 |
| 31 | (62.0, 64.0] | 238 | 0.218487 | 0.002224 | 52.0 | 186.0 | 0.001910 | 0.002330 | 0.598688 | 0.035532 | 0.093676 | 0.005064 |
| 32 | (64.0, 66.0] | 292 | 0.222603 | 0.002728 | 65.0 | 227.0 | 0.002388 | 0.002844 | 0.609543 | 0.004115 | 0.010855 | 0.005064 |
| 33 | (66.0, 68.0] | 258 | 0.267442 | 0.002410 | 69.0 | 189.0 | 0.002535 | 0.002368 | 0.727768 | 0.044839 | 0.118224 | 0.005064 |
| 34 | (68.0, 70.0] | 214 | 0.275701 | 0.001999 | 59.0 | 155.0 | 0.002167 | 0.001942 | 0.749572 | 0.008259 | 0.021804 | 0.005064 |
| 35 | (70.0, 72.0] | 194 | 0.242268 | 0.001812 | 47.0 | 147.0 | 0.001727 | 0.001842 | 0.661387 | 0.033433 | 0.088185 | 0.005064 |
| 36 | (72.0, 74.0] | 186 | 0.209677 | 0.001738 | 39.0 | 147.0 | 0.001433 | 0.001842 | 0.575437 | 0.032591 | 0.085950 | 0.005064 |
| 37 | (74.0, 76.0] | 172 | 0.244186 | 0.001607 | 42.0 | 130.0 | 0.001543 | 0.001629 | 0.666443 | 0.034509 | 0.091006 | 0.005064 |
| 38 | (76.0, 78.0] | 185 | 0.205405 | 0.001728 | 38.0 | 147.0 | 0.001396 | 0.001842 | 0.564154 | 0.038781 | 0.102288 | 0.005064 |
| 39 | (78.0, 80.0] | 154 | 0.201299 | 0.001439 | 31.0 | 123.0 | 0.001139 | 0.001541 | 0.553303 | 0.004107 | 0.010851 | 0.005064 |
| 40 | (80.0, 82.0] | 174 | 0.235632 | 0.001626 | 41.0 | 133.0 | 0.001506 | 0.001666 | 0.643896 | 0.034333 | 0.090593 | 0.005064 |
| 41 | (82.0, 84.0] | 147 | 0.258503 | 0.001373 | 38.0 | 109.0 | 0.001396 | 0.001366 | 0.704188 | 0.022871 | 0.060292 | 0.005064 |
| 42 | (84.0, 86.0] | 127 | 0.236220 | 0.001187 | 30.0 | 97.0 | 0.001102 | 0.001215 | 0.645447 | 0.022283 | 0.058741 | 0.005064 |
| 43 | (86.0, 88.0] | 141 | 0.241135 | 0.001317 | 34.0 | 107.0 | 0.001249 | 0.001341 | 0.658400 | 0.004914 | 0.012953 | 0.005064 |
| 44 | (88.0, 90.0] | 128 | 0.210938 | 0.001196 | 27.0 | 101.0 | 0.000992 | 0.001265 | 0.578764 | 0.030197 | 0.079636 | 0.005064 |
| 45 | (90.0, 92.0] | 152 | 0.236842 | 0.001420 | 36.0 | 116.0 | 0.001323 | 0.001453 | 0.647085 | 0.025905 | 0.068322 | 0.005064 |
| 46 | (92.0, 94.0] | 122 | 0.204918 | 0.001140 | 25.0 | 97.0 | 0.000918 | 0.001215 | 0.562867 | 0.031924 | 0.084218 | 0.005064 |
| 47 | (94.0, 96.0] | 160 | 0.200000 | 0.001495 | 32.0 | 128.0 | 0.001176 | 0.001604 | 0.549870 | 0.004918 | 0.012997 | 0.005064 |
| 48 | (96.0, 98.0] | 163 | 0.251534 | 0.001523 | 41.0 | 122.0 | 0.001506 | 0.001529 | 0.685811 | 0.051534 | 0.135941 | 0.005064 |
| 49 | (98.0, 100.0] | 135 | 0.200000 | 0.001261 | 27.0 | 108.0 | 0.000992 | 0.001353 | 0.549870 | 0.051534 | 0.135941 | 0.005064 |
plot_by_woe(df_temp.iloc[5 : 50, : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: 'Missig','0-4', '4-10', '10-20', '20-40', '40-100', '> 100'.
# 'Missing' will be the reference category.
df_inputs_prepr['mths_since_rcnt_il:0-4'] = np.where((df_inputs_prepr['mths_since_rcnt_il'] <= 4), 1, 0)
df_inputs_prepr['mths_since_rcnt_il:4-10'] = np.where((df_inputs_prepr['mths_since_rcnt_il'] > 4) & (df_inputs_prepr['mths_since_rcnt_il'] <= 10), 1, 0)
df_inputs_prepr['mths_since_rcnt_il:10-20'] = np.where((df_inputs_prepr['mths_since_rcnt_il'] > 10) & (df_inputs_prepr['mths_since_rcnt_il'] <= 20), 1, 0)
df_inputs_prepr['mths_since_rcnt_il:20-40'] = np.where((df_inputs_prepr['mths_since_rcnt_il'] > 20) & (df_inputs_prepr['mths_since_rcnt_il'] <= 40), 1, 0)
df_inputs_prepr['mths_since_rcnt_il:40-100'] = np.where((df_inputs_prepr['mths_since_rcnt_il'] > 40) & (df_inputs_prepr['mths_since_rcnt_il'] <= 100), 1, 0)
df_inputs_prepr['mths_since_rcnt_il:>100'] = np.where((df_inputs_prepr['mths_since_rcnt_il'] > 100) & (df_inputs_prepr['mths_since_rcnt_il'] <= 700), 1, 0)
df_inputs_prepr['mths_since_rcnt_il:Missing'] = np.where((df_inputs_prepr['mths_since_rcnt_il'] == 999), 1, 0)
Variable: 'mths_since_recent_bc'¶
df_inputs_prepr['mths_since_recent_bc'].nunique()
388
# 'mths_since_rcnt_il'
# the categories of everyone with 'mths_since_rcnt_il' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['mths_since_recent_bc'] <= 200, : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# df_inputs_prepr_temp
df_inputs_prepr_temp['mths_since_recent_bc_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_recent_bc'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'mths_since_recent_bc_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3184696751.py:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr_temp['mths_since_recent_bc_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_recent_bc'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3184696751.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['mths_since_recent_bc_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_recent_bc'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| mths_since_recent_bc_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.2, 4.0] | 49141 | 0.250300 | 0.179673 | 12300.0 | 36841.0 | 0.209647 | 0.171487 | 0.798642 | NaN | NaN | 0.012995 |
| 1 | (4.0, 8.0] | 42666 | 0.231425 | 0.155999 | 9874.0 | 32792.0 | 0.168297 | 0.152640 | 0.743163 | 0.018875 | 0.055480 | 0.012995 |
| 2 | (8.0, 12.0] | 33383 | 0.229967 | 0.122058 | 7677.0 | 25706.0 | 0.130851 | 0.119656 | 0.738863 | 0.001458 | 0.004300 | 0.012995 |
| 3 | (12.0, 16.0] | 38503 | 0.207542 | 0.140778 | 7991.0 | 30512.0 | 0.136202 | 0.142027 | 0.672428 | 0.022425 | 0.066435 | 0.012995 |
| 4 | (16.0, 20.0] | 19111 | 0.213594 | 0.069875 | 4082.0 | 15029.0 | 0.069576 | 0.069957 | 0.690418 | 0.006052 | 0.017989 | 0.012995 |
| 5 | (20.0, 24.0] | 15002 | 0.203906 | 0.054852 | 3059.0 | 11943.0 | 0.052139 | 0.055592 | 0.661596 | 0.009688 | 0.028821 | 0.012995 |
| 6 | (24.0, 28.0] | 11501 | 0.203113 | 0.042051 | 2336.0 | 9165.0 | 0.039816 | 0.042661 | 0.659231 | 0.000793 | 0.002366 | 0.012995 |
| 7 | (28.0, 32.0] | 8800 | 0.198068 | 0.032175 | 1743.0 | 7057.0 | 0.029709 | 0.032849 | 0.644167 | 0.005045 | 0.015064 | 0.012995 |
| 8 | (32.0, 36.0] | 7308 | 0.193487 | 0.026720 | 1414.0 | 5894.0 | 0.024101 | 0.027435 | 0.630452 | 0.004582 | 0.013714 | 0.012995 |
| 9 | (36.0, 40.0] | 5947 | 0.183454 | 0.021744 | 1091.0 | 4856.0 | 0.018596 | 0.022604 | 0.600306 | 0.010033 | 0.030147 | 0.012995 |
| 10 | (40.0, 44.0] | 4767 | 0.189218 | 0.017429 | 902.0 | 3865.0 | 0.015374 | 0.017991 | 0.617645 | 0.005764 | 0.017339 | 0.012995 |
| 11 | (44.0, 48.0] | 4078 | 0.172388 | 0.014910 | 703.0 | 3375.0 | 0.011982 | 0.015710 | 0.566857 | 0.016829 | 0.050787 | 0.012995 |
| 12 | (48.0, 52.0] | 3450 | 0.178551 | 0.012614 | 616.0 | 2834.0 | 0.010499 | 0.013192 | 0.585512 | 0.006162 | 0.018654 | 0.012995 |
| 13 | (52.0, 56.0] | 2966 | 0.196898 | 0.010845 | 584.0 | 2382.0 | 0.009954 | 0.011088 | 0.640667 | 0.018347 | 0.055156 | 0.012995 |
| 14 | (56.0, 60.0] | 2583 | 0.160279 | 0.009444 | 414.0 | 2169.0 | 0.007056 | 0.010096 | 0.529989 | 0.036619 | 0.110678 | 0.012995 |
| 15 | (60.0, 64.0] | 2329 | 0.165307 | 0.008515 | 385.0 | 1944.0 | 0.006562 | 0.009049 | 0.545333 | 0.005028 | 0.015344 | 0.012995 |
| 16 | (64.0, 68.0] | 2087 | 0.178246 | 0.007631 | 372.0 | 1715.0 | 0.006341 | 0.007983 | 0.584592 | 0.012939 | 0.039259 | 0.012995 |
| 17 | (68.0, 72.0] | 2055 | 0.152311 | 0.007514 | 313.0 | 1742.0 | 0.005335 | 0.008109 | 0.505569 | 0.025935 | 0.079022 | 0.012995 |
| 18 | (72.0, 76.0] | 1817 | 0.162356 | 0.006643 | 295.0 | 1522.0 | 0.005028 | 0.007085 | 0.536333 | 0.010044 | 0.030763 | 0.012995 |
| 19 | (76.0, 80.0] | 1640 | 0.160976 | 0.005996 | 264.0 | 1376.0 | 0.004500 | 0.006405 | 0.532119 | 0.001380 | 0.004214 | 0.012995 |
| 20 | (80.0, 84.0] | 1477 | 0.149628 | 0.005400 | 221.0 | 1256.0 | 0.003767 | 0.005846 | 0.497312 | 0.011348 | 0.034806 | 0.012995 |
| 21 | (84.0, 88.0] | 1471 | 0.148878 | 0.005378 | 219.0 | 1252.0 | 0.003733 | 0.005828 | 0.495004 | 0.000749 | 0.002308 | 0.012995 |
| 22 | (88.0, 92.0] | 1358 | 0.153903 | 0.004965 | 209.0 | 1149.0 | 0.003562 | 0.005348 | 0.510458 | 0.005024 | 0.015453 | 0.012995 |
| 23 | (92.0, 96.0] | 1129 | 0.169176 | 0.004128 | 191.0 | 938.0 | 0.003255 | 0.004366 | 0.557106 | 0.015273 | 0.046648 | 0.012995 |
| 24 | (96.0, 100.0] | 1045 | 0.143541 | 0.003821 | 150.0 | 895.0 | 0.002557 | 0.004166 | 0.478525 | 0.025636 | 0.078580 | 0.012995 |
| 25 | (100.0, 104.0] | 971 | 0.144181 | 0.003550 | 140.0 | 831.0 | 0.002386 | 0.003868 | 0.480506 | 0.000641 | 0.001981 | 0.012995 |
| 26 | (104.0, 108.0] | 891 | 0.173962 | 0.003258 | 155.0 | 736.0 | 0.002642 | 0.003426 | 0.571627 | 0.029781 | 0.091120 | 0.012995 |
| 27 | (108.0, 112.0] | 724 | 0.168508 | 0.002647 | 122.0 | 602.0 | 0.002079 | 0.002802 | 0.555075 | 0.005454 | 0.016552 | 0.012995 |
| 28 | (112.0, 116.0] | 688 | 0.164244 | 0.002516 | 113.0 | 575.0 | 0.001926 | 0.002677 | 0.542094 | 0.004264 | 0.012981 | 0.012995 |
| 29 | (116.0, 120.0] | 592 | 0.148649 | 0.002165 | 88.0 | 504.0 | 0.001500 | 0.002346 | 0.494297 | 0.015596 | 0.047797 | 0.012995 |
| 30 | (120.0, 124.0] | 503 | 0.143141 | 0.001839 | 72.0 | 431.0 | 0.001227 | 0.002006 | 0.477289 | 0.005507 | 0.017007 | 0.012995 |
| 31 | (124.0, 128.0] | 437 | 0.162471 | 0.001598 | 71.0 | 366.0 | 0.001210 | 0.001704 | 0.536686 | 0.019330 | 0.059397 | 0.012995 |
| 32 | (128.0, 132.0] | 368 | 0.157609 | 0.001346 | 58.0 | 310.0 | 0.000989 | 0.001443 | 0.521820 | 0.004863 | 0.014866 | 0.012995 |
| 33 | (132.0, 136.0] | 362 | 0.162983 | 0.001324 | 59.0 | 303.0 | 0.001006 | 0.001410 | 0.538249 | 0.005375 | 0.016428 | 0.012995 |
| 34 | (136.0, 140.0] | 276 | 0.163043 | 0.001009 | 45.0 | 231.0 | 0.000767 | 0.001075 | 0.538432 | 0.000060 | 0.000183 | 0.012995 |
| 35 | (140.0, 144.0] | 246 | 0.138211 | 0.000899 | 34.0 | 212.0 | 0.000580 | 0.000987 | 0.462005 | 0.024832 | 0.076427 | 0.012995 |
| 36 | (144.0, 148.0] | 233 | 0.145923 | 0.000852 | 34.0 | 199.0 | 0.000580 | 0.000926 | 0.485888 | 0.007711 | 0.023882 | 0.012995 |
| 37 | (148.0, 152.0] | 211 | 0.194313 | 0.000771 | 41.0 | 170.0 | 0.000699 | 0.000791 | 0.632928 | 0.048390 | 0.147040 | 0.012995 |
| 38 | (152.0, 156.0] | 190 | 0.221053 | 0.000695 | 42.0 | 148.0 | 0.000716 | 0.000689 | 0.712524 | 0.026740 | 0.079596 | 0.012995 |
| 39 | (156.0, 160.0] | 164 | 0.146341 | 0.000600 | 24.0 | 140.0 | 0.000409 | 0.000652 | 0.487180 | 0.074711 | 0.225344 | 0.012995 |
| 40 | (160.0, 164.0] | 167 | 0.215569 | 0.000611 | 36.0 | 131.0 | 0.000614 | 0.000610 | 0.696277 | 0.069227 | 0.209096 | 0.012995 |
| 41 | (164.0, 168.0] | 146 | 0.123288 | 0.000534 | 18.0 | 128.0 | 0.000307 | 0.000596 | 0.415367 | 0.092281 | 0.280910 | 0.012995 |
| 42 | (168.0, 172.0] | 130 | 0.192308 | 0.000475 | 25.0 | 105.0 | 0.000426 | 0.000489 | 0.626918 | 0.069020 | 0.211551 | 0.012995 |
| 43 | (172.0, 176.0] | 112 | 0.125000 | 0.000410 | 14.0 | 98.0 | 0.000239 | 0.000456 | 0.420748 | 0.067308 | 0.206171 | 0.012995 |
| 44 | (176.0, 180.0] | 110 | 0.145455 | 0.000402 | 16.0 | 94.0 | 0.000273 | 0.000438 | 0.484442 | 0.020455 | 0.063694 | 0.012995 |
| 45 | (180.0, 184.0] | 93 | 0.129032 | 0.000340 | 12.0 | 81.0 | 0.000205 | 0.000377 | 0.433388 | 0.016422 | 0.051054 | 0.012995 |
| 46 | (184.0, 188.0] | 67 | 0.164179 | 0.000245 | 11.0 | 56.0 | 0.000187 | 0.000261 | 0.541896 | 0.035147 | 0.108508 | 0.012995 |
| 47 | (188.0, 192.0] | 76 | 0.210526 | 0.000278 | 16.0 | 60.0 | 0.000273 | 0.000279 | 0.681304 | 0.046347 | 0.139409 | 0.012995 |
| 48 | (192.0, 196.0] | 76 | 0.092105 | 0.000278 | 7.0 | 69.0 | 0.000119 | 0.000321 | 0.315888 | 0.118421 | 0.365416 | 0.012995 |
| 49 | (196.0, 200.0] | 55 | 0.218182 | 0.000201 | 12.0 | 43.0 | 0.000205 | 0.000200 | 0.704023 | 0.126077 | 0.388135 | 0.012995 |
plot_by_woe(df_temp.iloc[10 : , : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0-12', '12-32', '32-52', '52-68', '68-100', '100-130', '> 130'.
# '> 130' will be the reference category.
df_inputs_prepr['mths_since_recent_bc:0-12'] = np.where((df_inputs_prepr['mths_since_recent_bc'] <= 12), 1, 0)
df_inputs_prepr['mths_since_recent_bc:12-32'] = np.where((df_inputs_prepr['mths_since_recent_bc'] > 12) & (df_inputs_prepr['mths_since_recent_bc'] <= 32), 1, 0)
df_inputs_prepr['mths_since_recent_bc:32-52'] = np.where((df_inputs_prepr['mths_since_recent_bc'] > 32) & (df_inputs_prepr['mths_since_recent_bc'] <= 52), 1, 0)
df_inputs_prepr['mths_since_recent_bc:52-68'] = np.where((df_inputs_prepr['mths_since_recent_bc'] > 52) & (df_inputs_prepr['mths_since_recent_bc'] <= 68), 1, 0)
df_inputs_prepr['mths_since_recent_bc:68-100'] = np.where((df_inputs_prepr['mths_since_recent_bc'] > 68) & (df_inputs_prepr['mths_since_recent_bc'] <= 100), 1, 0)
df_inputs_prepr['mths_since_recent_bc:100-130'] = np.where((df_inputs_prepr['mths_since_recent_bc'] > 100) & (df_inputs_prepr['mths_since_recent_bc'] <= 130), 1, 0)
df_inputs_prepr['mths_since_recent_bc:>130'] = np.where((df_inputs_prepr['mths_since_recent_bc'] > 130), 1, 0)
Variable: 'mths_since_recent_revol_delinq'¶
df_inputs_prepr['mths_since_recent_revol_delinq'].nunique()
139
df_inputs_prepr.loc[df_inputs_prepr['mths_since_recent_revol_delinq'] >= 999., : ]['mths_since_recent_revol_delinq'].count()
182457
There are 729465 missing value filled with '999'
# 'mths_since_recent_revol_delinq'
# the categories of everyone with 'mths_since_recent_revol_delinq' less or equal 84.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['mths_since_recent_revol_delinq'] <= 120, : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# df_inputs_prepr_temp
df_inputs_prepr_temp['mths_since_recent_revol_delinq_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_recent_revol_delinq'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'mths_since_recent_revol_delinq_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3078955673.py:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr_temp['mths_since_recent_revol_delinq_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_recent_revol_delinq'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\3078955673.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['mths_since_recent_revol_delinq_factor'] = pd.cut(df_inputs_prepr_temp['mths_since_recent_revol_delinq'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| mths_since_recent_revol_delinq_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.12, 2.4] | 1611 | 0.218498 | 0.017557 | 352.0 | 1259.0 | 0.017197 | 0.017661 | 0.679916 | NaN | NaN | 0.001497 |
| 1 | (2.4, 4.8] | 2141 | 0.223260 | 0.023334 | 478.0 | 1663.0 | 0.023352 | 0.023328 | 0.693665 | 0.004762 | 0.013748 | 0.001497 |
| 2 | (4.8, 7.2] | 4412 | 0.241387 | 0.048084 | 1065.0 | 3347.0 | 0.052030 | 0.046951 | 0.745822 | 0.018127 | 0.052157 | 0.001497 |
| 3 | (7.2, 9.6] | 3107 | 0.228194 | 0.033862 | 709.0 | 2398.0 | 0.034638 | 0.033639 | 0.707888 | 0.013193 | 0.037934 | 0.001497 |
| 4 | (9.6, 12.0] | 4736 | 0.229941 | 0.051615 | 1089.0 | 3647.0 | 0.053202 | 0.051159 | 0.712918 | 0.001746 | 0.005029 | 0.001497 |
| 5 | (12.0, 14.4] | 3210 | 0.224611 | 0.034984 | 721.0 | 2489.0 | 0.035224 | 0.034915 | 0.697560 | 0.005330 | 0.015358 | 0.001497 |
| 6 | (14.4, 16.8] | 3146 | 0.223140 | 0.034287 | 702.0 | 2444.0 | 0.034296 | 0.034284 | 0.693319 | 0.001470 | 0.004240 | 0.001497 |
| 7 | (16.8, 19.2] | 4728 | 0.228003 | 0.051528 | 1078.0 | 3650.0 | 0.052665 | 0.051201 | 0.707338 | 0.004863 | 0.014018 | 0.001497 |
| 8 | (19.2, 21.6] | 2996 | 0.226302 | 0.032652 | 678.0 | 2318.0 | 0.033123 | 0.032516 | 0.702435 | 0.001702 | 0.004903 | 0.001497 |
| 9 | (21.6, 24.0] | 4266 | 0.230192 | 0.046493 | 982.0 | 3284.0 | 0.047975 | 0.046067 | 0.713641 | 0.003890 | 0.011206 | 0.001497 |
| 10 | (24.0, 26.4] | 3029 | 0.224166 | 0.033011 | 679.0 | 2350.0 | 0.033172 | 0.032965 | 0.696279 | 0.006026 | 0.017363 | 0.001497 |
| 11 | (26.4, 28.8] | 2947 | 0.227689 | 0.032118 | 671.0 | 2276.0 | 0.032781 | 0.031927 | 0.706433 | 0.003523 | 0.010154 | 0.001497 |
| 12 | (28.8, 31.2] | 4179 | 0.227327 | 0.045545 | 950.0 | 3229.0 | 0.046412 | 0.045296 | 0.705390 | 0.000362 | 0.001043 | 0.001497 |
| 13 | (31.2, 33.6] | 2728 | 0.215176 | 0.029731 | 587.0 | 2141.0 | 0.028678 | 0.030034 | 0.670313 | 0.012151 | 0.035076 | 0.001497 |
| 14 | (33.6, 36.0] | 3969 | 0.222978 | 0.043256 | 885.0 | 3084.0 | 0.043236 | 0.043262 | 0.692851 | 0.007802 | 0.022537 | 0.001497 |
| 15 | (36.0, 38.4] | 2587 | 0.222652 | 0.028194 | 576.0 | 2011.0 | 0.028140 | 0.028210 | 0.691909 | 0.000326 | 0.000942 | 0.001497 |
| 16 | (38.4, 40.8] | 2571 | 0.213536 | 0.028020 | 549.0 | 2022.0 | 0.026821 | 0.028364 | 0.665568 | 0.009116 | 0.026342 | 0.001497 |
| 17 | (40.8, 43.2] | 3754 | 0.216569 | 0.040913 | 813.0 | 2941.0 | 0.039719 | 0.041256 | 0.674342 | 0.003033 | 0.008774 | 0.001497 |
| 18 | (43.2, 45.6] | 2361 | 0.221093 | 0.025731 | 522.0 | 1839.0 | 0.025502 | 0.025797 | 0.687410 | 0.004524 | 0.013068 | 0.001497 |
| 19 | (45.6, 48.0] | 3641 | 0.213678 | 0.039681 | 778.0 | 2863.0 | 0.038009 | 0.040162 | 0.665978 | 0.007415 | 0.021432 | 0.001497 |
| 20 | (48.0, 50.4] | 1806 | 0.224252 | 0.019683 | 405.0 | 1401.0 | 0.019786 | 0.019653 | 0.696527 | 0.010575 | 0.030548 | 0.001497 |
| 21 | (50.4, 52.8] | 1400 | 0.242143 | 0.015258 | 339.0 | 1061.0 | 0.016562 | 0.014883 | 0.747991 | 0.017890 | 0.051464 | 0.001497 |
| 22 | (52.8, 55.2] | 2214 | 0.211834 | 0.024129 | 469.0 | 1745.0 | 0.022913 | 0.024479 | 0.660641 | 0.030309 | 0.087350 | 0.001497 |
| 23 | (55.2, 57.6] | 1582 | 0.201643 | 0.017241 | 319.0 | 1263.0 | 0.015585 | 0.017717 | 0.631076 | 0.010190 | 0.029565 | 0.001497 |
| 24 | (57.6, 60.0] | 2269 | 0.215954 | 0.024729 | 490.0 | 1779.0 | 0.023939 | 0.024955 | 0.672564 | 0.014311 | 0.041488 | 0.001497 |
| 25 | (60.0, 62.4] | 1490 | 0.228188 | 0.016239 | 340.0 | 1150.0 | 0.016610 | 0.016132 | 0.707869 | 0.012234 | 0.035305 | 0.001497 |
| 26 | (62.4, 64.8] | 1531 | 0.225996 | 0.016686 | 346.0 | 1185.0 | 0.016904 | 0.016623 | 0.701554 | 0.002192 | 0.006316 | 0.001497 |
| 27 | (64.8, 67.2] | 2426 | 0.220528 | 0.026440 | 535.0 | 1891.0 | 0.026137 | 0.026527 | 0.685779 | 0.005468 | 0.015775 | 0.001497 |
| 28 | (67.2, 69.6] | 1576 | 0.227157 | 0.017176 | 358.0 | 1218.0 | 0.017490 | 0.017086 | 0.704900 | 0.006630 | 0.019122 | 0.001497 |
| 29 | (69.6, 72.0] | 2337 | 0.216945 | 0.025470 | 507.0 | 1830.0 | 0.024769 | 0.025671 | 0.675428 | 0.010213 | 0.029472 | 0.001497 |
| 30 | (72.0, 74.4] | 1579 | 0.212160 | 0.017209 | 335.0 | 1244.0 | 0.016366 | 0.017451 | 0.661584 | 0.004785 | 0.013844 | 0.001497 |
| 31 | (74.4, 76.8] | 1577 | 0.224477 | 0.017187 | 354.0 | 1223.0 | 0.017294 | 0.017156 | 0.697174 | 0.012317 | 0.035589 | 0.001497 |
| 32 | (76.8, 79.2] | 1885 | 0.207427 | 0.020544 | 391.0 | 1494.0 | 0.019102 | 0.020958 | 0.647870 | 0.017050 | 0.049304 | 0.001497 |
| 33 | (79.2, 81.6] | 1143 | 0.200350 | 0.012457 | 229.0 | 914.0 | 0.011188 | 0.012821 | 0.627315 | 0.007077 | 0.020555 | 0.001497 |
| 34 | (81.6, 84.0] | 362 | 0.218232 | 0.003945 | 79.0 | 283.0 | 0.003859 | 0.003970 | 0.679148 | 0.017882 | 0.051834 | 0.001497 |
| 35 | (84.0, 86.4] | 85 | 0.223529 | 0.000926 | 19.0 | 66.0 | 0.000928 | 0.000926 | 0.694441 | 0.005297 | 0.015293 | 0.001497 |
| 36 | (86.4, 88.8] | 58 | 0.155172 | 0.000632 | 9.0 | 49.0 | 0.000440 | 0.000687 | 0.494499 | 0.068357 | 0.199943 | 0.001497 |
| 37 | (88.8, 91.2] | 74 | 0.243243 | 0.000806 | 18.0 | 56.0 | 0.000879 | 0.000786 | 0.751149 | 0.088071 | 0.256650 | 0.001497 |
| 38 | (91.2, 93.6] | 39 | 0.230769 | 0.000425 | 9.0 | 30.0 | 0.000440 | 0.000421 | 0.715302 | 0.012474 | 0.035847 | 0.001497 |
| 39 | (93.6, 96.0] | 46 | 0.195652 | 0.000501 | 9.0 | 37.0 | 0.000440 | 0.000519 | 0.613638 | 0.035117 | 0.101664 | 0.001497 |
| 40 | (96.0, 98.4] | 31 | 0.387097 | 0.000338 | 12.0 | 19.0 | 0.000586 | 0.000267 | 1.163022 | 0.191445 | 0.549384 | 0.001497 |
| 41 | (98.4, 100.8] | 24 | 0.291667 | 0.000262 | 7.0 | 17.0 | 0.000342 | 0.000238 | 0.889555 | 0.095430 | 0.273468 | 0.001497 |
| 42 | (100.8, 103.2] | 27 | 0.259259 | 0.000294 | 7.0 | 20.0 | 0.000342 | 0.000281 | 0.797029 | 0.032407 | 0.092526 | 0.001497 |
| 43 | (103.2, 105.6] | 13 | 0.153846 | 0.000142 | 2.0 | 11.0 | 0.000098 | 0.000154 | 0.490550 | 0.105413 | 0.306479 | 0.001497 |
| 44 | (105.6, 108.0] | 22 | 0.363636 | 0.000240 | 8.0 | 14.0 | 0.000391 | 0.000196 | 1.095308 | 0.209790 | 0.604758 | 0.001497 |
| 45 | (108.0, 110.4] | 9 | 0.222222 | 0.000098 | 2.0 | 7.0 | 0.000098 | 0.000098 | 0.690670 | 0.141414 | 0.404638 | 0.001497 |
| 46 | (110.4, 112.8] | 8 | 0.125000 | 0.000087 | 1.0 | 7.0 | 0.000049 | 0.000098 | 0.403814 | 0.097222 | 0.286856 | 0.001497 |
| 47 | (112.8, 115.2] | 11 | 0.272727 | 0.000120 | 3.0 | 8.0 | 0.000147 | 0.000112 | 0.835517 | 0.147727 | 0.431702 | 0.001497 |
| 48 | (115.2, 117.6] | 7 | 0.285714 | 0.000076 | 2.0 | 5.0 | 0.000098 | 0.000070 | 0.872578 | 0.012987 | 0.037061 | 0.001497 |
| 49 | (117.6, 120.0] | 6 | 0.166667 | 0.000065 | 1.0 | 5.0 | 0.000049 | 0.000070 | 0.528589 | 0.119048 | 0.343989 | 0.001497 |
plot_by_woe(df_temp.iloc[ : , : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0-20', '20-34', '34-50', '50-84', '>84', 'Missing'.
# '> 84' will be the reference category.
df_inputs_prepr['mths_since_recent_revol_delinq:0-20'] = np.where((df_inputs_prepr['mths_since_recent_revol_delinq'] <= 20), 1, 0)
df_inputs_prepr['mths_since_recent_revol_delinq:20-34'] = np.where((df_inputs_prepr['mths_since_recent_revol_delinq'] > 20) & (df_inputs_prepr['mths_since_recent_revol_delinq'] <= 34), 1, 0)
df_inputs_prepr['mths_since_recent_revol_delinq:34-50'] = np.where((df_inputs_prepr['mths_since_recent_revol_delinq'] > 34) & (df_inputs_prepr['mths_since_recent_revol_delinq'] <= 50), 1, 0)
df_inputs_prepr['mths_since_recent_revol_delinq:50-84'] = np.where((df_inputs_prepr['mths_since_recent_revol_delinq'] > 50) & (df_inputs_prepr['mths_since_recent_revol_delinq'] <= 84), 1, 0)
df_inputs_prepr['mths_since_recent_revol_delinq:>84'] = np.where((df_inputs_prepr['mths_since_recent_revol_delinq'] > 84) & (df_inputs_prepr['mths_since_recent_revol_delinq'] <= 800), 1, 0)
df_inputs_prepr['mths_since_recent_revol_delinq:Missing'] = np.where((df_inputs_prepr['mths_since_recent_revol_delinq'] == 999), 1, 0)
Variable: 'percent_bc_gt_75'¶
df_inputs_prepr['percent_bc_gt_75'].nunique()
183
# 'percent_bc_gt_75'
# the categories of everyone with 'percent_bc_gt_75' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[df_inputs_prepr['percent_bc_gt_75'] <= 300, : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# df_inputs_prepr_temp
df_inputs_prepr_temp['percent_bc_gt_75_factor'] = pd.cut(df_inputs_prepr_temp['percent_bc_gt_75'], 25)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'percent_bc_gt_75_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\4160733302.py:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr_temp['percent_bc_gt_75_factor'] = pd.cut(df_inputs_prepr_temp['percent_bc_gt_75'], 25) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| percent_bc_gt_75_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (-0.1, 4.0] | 63770 | 0.179991 | 0.232539 | 11478.0 | 52292.0 | 0.195197 | 0.242731 | 0.590102 | NaN | NaN | 0.013022 |
| 1 | (4.0, 8.0] | 651 | 0.239631 | 0.002374 | 156.0 | 495.0 | 0.002653 | 0.002298 | 0.767612 | 0.059641 | 0.177511 | 0.013022 |
| 2 | (8.0, 12.0] | 2922 | 0.216632 | 0.010655 | 633.0 | 2289.0 | 0.010765 | 0.010625 | 0.699703 | 0.022999 | 0.067909 | 0.013022 |
| 3 | (12.0, 16.0] | 5355 | 0.205602 | 0.019527 | 1101.0 | 4254.0 | 0.018724 | 0.019746 | 0.666915 | 0.011030 | 0.032788 | 0.013022 |
| 4 | (16.0, 20.0] | 13018 | 0.200031 | 0.047470 | 2604.0 | 10414.0 | 0.044284 | 0.048340 | 0.650290 | 0.005572 | 0.016624 | 0.013022 |
| 5 | (20.0, 24.0] | 1336 | 0.203593 | 0.004872 | 272.0 | 1064.0 | 0.004626 | 0.004939 | 0.660924 | 0.003562 | 0.010634 | 0.013022 |
| 6 | (24.0, 28.0] | 12199 | 0.201574 | 0.044484 | 2459.0 | 9740.0 | 0.041818 | 0.045211 | 0.654899 | 0.002019 | 0.006025 | 0.013022 |
| 7 | (28.0, 32.0] | 3266 | 0.216473 | 0.011910 | 707.0 | 2559.0 | 0.012023 | 0.011878 | 0.699230 | 0.014899 | 0.044330 | 0.013022 |
| 8 | (32.0, 36.0] | 18131 | 0.207600 | 0.066115 | 3764.0 | 14367.0 | 0.064011 | 0.066689 | 0.672866 | 0.008873 | 0.026364 | 0.013022 |
| 9 | (36.0, 40.0] | 21265 | 0.190877 | 0.077543 | 4059.0 | 17206.0 | 0.069028 | 0.079867 | 0.622878 | 0.016723 | 0.049988 | 0.013022 |
| 10 | (40.0, 44.0] | 2460 | 0.238211 | 0.008970 | 586.0 | 1874.0 | 0.009966 | 0.008699 | 0.763435 | 0.047334 | 0.140557 | 0.013022 |
| 11 | (44.0, 48.0] | 1076 | 0.266729 | 0.003924 | 287.0 | 789.0 | 0.004881 | 0.003662 | 0.847014 | 0.028517 | 0.083579 | 0.013022 |
| 12 | (48.0, 52.0] | 29228 | 0.217394 | 0.106581 | 6354.0 | 22874.0 | 0.108058 | 0.106177 | 0.701962 | 0.049334 | 0.145052 | 0.013022 |
| 13 | (52.0, 56.0] | 900 | 0.257778 | 0.003282 | 232.0 | 668.0 | 0.003945 | 0.003101 | 0.820844 | 0.040383 | 0.118882 | 0.013022 |
| 14 | (56.0, 60.0] | 8036 | 0.238676 | 0.029303 | 1918.0 | 6118.0 | 0.032618 | 0.028399 | 0.764802 | 0.019102 | 0.056042 | 0.013022 |
| 15 | (60.0, 64.0] | 1194 | 0.256281 | 0.004354 | 306.0 | 888.0 | 0.005204 | 0.004122 | 0.816464 | 0.017605 | 0.051662 | 0.013022 |
| 16 | (64.0, 68.0] | 17588 | 0.233057 | 0.064135 | 4099.0 | 13489.0 | 0.069709 | 0.062614 | 0.748256 | 0.023225 | 0.068209 | 0.013022 |
| 17 | (68.0, 72.0] | 1995 | 0.258145 | 0.007275 | 515.0 | 1480.0 | 0.008758 | 0.006870 | 0.821920 | 0.025089 | 0.073664 | 0.013022 |
| 18 | (72.0, 76.0] | 10247 | 0.238899 | 0.037366 | 2448.0 | 7799.0 | 0.041631 | 0.036202 | 0.765459 | 0.019246 | 0.056461 | 0.013022 |
| 19 | (76.0, 80.0] | 5941 | 0.267127 | 0.021664 | 1587.0 | 4354.0 | 0.026989 | 0.020211 | 0.848177 | 0.028228 | 0.082718 | 0.013022 |
| 20 | (80.0, 84.0] | 2908 | 0.256878 | 0.010604 | 747.0 | 2161.0 | 0.012704 | 0.010031 | 0.818209 | 0.010249 | 0.029967 | 0.013022 |
| 21 | (84.0, 88.0] | 2148 | 0.281192 | 0.007833 | 604.0 | 1544.0 | 0.010272 | 0.007167 | 0.889209 | 0.024314 | 0.070999 | 0.013022 |
| 22 | (88.0, 92.0] | 710 | 0.315493 | 0.002589 | 224.0 | 486.0 | 0.003809 | 0.002256 | 0.989025 | 0.034301 | 0.099817 | 0.013022 |
| 23 | (92.0, 96.0] | 70 | 0.328571 | 0.000255 | 23.0 | 47.0 | 0.000391 | 0.000218 | 1.027069 | 0.013078 | 0.038044 | 0.013022 |
| 24 | (96.0, 100.0] | 47820 | 0.243392 | 0.174377 | 11639.0 | 36181.0 | 0.197935 | 0.167946 | 0.778666 | 0.085180 | 0.248403 | 0.013022 |
plot_by_woe(df_temp.iloc[ : , : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0-4', '4-20', '20-40', '40-70', '70-96', '>96'.
# '> 84' will be the reference category.
df_inputs_prepr['percent_bc_gt_75:0-4'] = np.where((df_inputs_prepr['percent_bc_gt_75'] <= 4), 1, 0)
df_inputs_prepr['percent_bc_gt_75:4-20'] = np.where((df_inputs_prepr['percent_bc_gt_75'] > 4) & (df_inputs_prepr['percent_bc_gt_75'] <= 20), 1, 0)
df_inputs_prepr['percent_bc_gt_75:20-40'] = np.where((df_inputs_prepr['percent_bc_gt_75'] > 20) & (df_inputs_prepr['percent_bc_gt_75'] <= 40), 1, 0)
df_inputs_prepr['percent_bc_gt_75:40-70'] = np.where((df_inputs_prepr['percent_bc_gt_75'] > 40) & (df_inputs_prepr['percent_bc_gt_75'] <= 70), 1, 0)
df_inputs_prepr['percent_bc_gt_75:70-96'] = np.where((df_inputs_prepr['percent_bc_gt_75'] > 70) & (df_inputs_prepr['percent_bc_gt_75'] <= 96), 1, 0)
df_inputs_prepr['percent_bc_gt_75:>96'] = np.where((df_inputs_prepr['percent_bc_gt_75'] > 96), 1, 0)
Variable: 'pub_rec_bankruptcies'¶
df_inputs_prepr['pub_rec_bankruptcies'].unique()
array([ 0., 1., 2., 3., 6., 5., 4., 7., 8., 12.])
# 'mths_since_recent_inq'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'pub_rec_bankruptcies', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| pub_rec_bankruptcies | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 240047 | 0.211046 | 0.875336 | 50661.0 | 189386.0 | 0.861552 | 0.879099 | 0.683117 | NaN | NaN | 0.001597 |
| 1 | 1.0 | 32104 | 0.236170 | 0.117068 | 7582.0 | 24522.0 | 0.128941 | 0.113827 | 0.757427 | 0.025124 | 0.074310 | 0.001597 |
| 2 | 2.0 | 1611 | 0.271260 | 0.005875 | 437.0 | 1174.0 | 0.007432 | 0.005450 | 0.860245 | 0.035090 | 0.102818 | 0.001597 |
| 3 | 3.0 | 346 | 0.239884 | 0.001262 | 83.0 | 263.0 | 0.001412 | 0.001221 | 0.768357 | 0.031376 | 0.091888 | 0.001597 |
| 4 | 4.0 | 85 | 0.282353 | 0.000310 | 24.0 | 61.0 | 0.000408 | 0.000283 | 0.892592 | 0.042469 | 0.124235 | 0.001597 |
| 5 | 5.0 | 27 | 0.333333 | 0.000098 | 9.0 | 18.0 | 0.000153 | 0.000084 | 1.040928 | 0.050980 | 0.148336 | 0.001597 |
| 6 | 6.0 | 9 | 0.444444 | 0.000033 | 4.0 | 5.0 | 0.000068 | 0.000023 | 1.368881 | 0.111111 | 0.327953 | 0.001597 |
| 7 | 7.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.055556 | 0.170925 | 0.001597 |
| 8 | 8.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.000000 | 0.000000 | 0.001597 |
| 9 | 12.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.500000 | 1.539806 | 0.001597 |
plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0', '1-3', '>=4'.
# '>=4' will be the reference category
df_inputs_prepr['pub_rec_bankruptcies:0'] = np.where(df_inputs_prepr['pub_rec_bankruptcies'].isin([0]), 1, 0)
df_inputs_prepr['pub_rec_bankruptcies:1-3'] = np.where(df_inputs_prepr['pub_rec_bankruptcies'].isin(range(1, 4)), 1, 0)
df_inputs_prepr['pub_rec_bankruptcies:>4'] = np.where(df_inputs_prepr['pub_rec_bankruptcies'].isin(range(4, 100)), 1, 0)
Variable: 'tot_coll_amt'¶
df_inputs_prepr['tot_coll_amt'].nunique()
6183
df_inputs_prepr.loc[df_inputs_prepr['tot_coll_amt'] == 0., : ]['tot_coll_amt'].count()
234254
There are 936642 rows with 0 value. A category will be considered for 0.
# 'tot_coll_amt'
# the categories of everyone with 'tot_coll_amt' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[(df_inputs_prepr['tot_coll_amt'] != 0) & (df_inputs_prepr['tot_coll_amt'] <= 1000), : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# df_inputs_prepr_temp
df_inputs_prepr_temp['tot_coll_amt_factor'] = pd.cut(df_inputs_prepr_temp['tot_coll_amt'], 50)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'tot_coll_amt_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1959157564.py:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr_temp['tot_coll_amt_factor'] = pd.cut(df_inputs_prepr_temp['tot_coll_amt'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1959157564.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_inputs_prepr_temp['tot_coll_amt_factor'] = pd.cut(df_inputs_prepr_temp['tot_coll_amt'], 50) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| tot_coll_amt_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (3.004, 23.92] | 22 | 0.227273 | 0.000757 | 5.0 | 17.0 | 0.000711 | 0.000772 | 0.652650 | NaN | NaN | 0.005679 |
| 1 | (23.92, 43.84] | 198 | 0.318182 | 0.006817 | 63.0 | 135.0 | 0.008959 | 0.006133 | 0.900455 | 0.090909 | 0.247805 | 0.005679 |
| 2 | (43.84, 63.76] | 2083 | 0.265002 | 0.071721 | 552.0 | 1531.0 | 0.078498 | 0.069556 | 0.755446 | 0.053179 | 0.145009 | 0.005679 |
| 3 | (63.76, 83.68] | 2599 | 0.252790 | 0.089488 | 657.0 | 1942.0 | 0.093430 | 0.088229 | 0.722198 | 0.012213 | 0.033248 | 0.005679 |
| 4 | (83.68, 103.6] | 2298 | 0.247607 | 0.079124 | 569.0 | 1729.0 | 0.080916 | 0.078552 | 0.708084 | 0.005183 | 0.014114 | 0.005679 |
| 5 | (103.6, 123.52] | 1322 | 0.217852 | 0.045519 | 288.0 | 1034.0 | 0.040956 | 0.046977 | 0.626918 | 0.029755 | 0.081166 | 0.005679 |
| 6 | (123.52, 143.44] | 1267 | 0.238358 | 0.043625 | 302.0 | 965.0 | 0.042947 | 0.043842 | 0.682885 | 0.020507 | 0.055968 | 0.005679 |
| 7 | (143.44, 163.36] | 1405 | 0.239858 | 0.048377 | 337.0 | 1068.0 | 0.047924 | 0.048521 | 0.686972 | 0.001499 | 0.004086 | 0.005679 |
| 8 | (163.36, 183.28] | 1091 | 0.253896 | 0.037565 | 277.0 | 814.0 | 0.039391 | 0.036982 | 0.725209 | 0.014038 | 0.038237 | 0.005679 |
| 9 | (183.28, 203.2] | 1138 | 0.222320 | 0.039183 | 253.0 | 885.0 | 0.035978 | 0.040207 | 0.639127 | 0.031576 | 0.086083 | 0.005679 |
| 10 | (203.2, 223.12] | 931 | 0.258861 | 0.032056 | 241.0 | 690.0 | 0.034272 | 0.031348 | 0.738729 | 0.036542 | 0.099603 | 0.005679 |
| 11 | (223.12, 243.04] | 822 | 0.229927 | 0.028303 | 189.0 | 633.0 | 0.026877 | 0.028758 | 0.659893 | 0.028934 | 0.078836 | 0.005679 |
| 12 | (243.04, 262.96] | 818 | 0.226161 | 0.028165 | 185.0 | 633.0 | 0.026308 | 0.028758 | 0.649616 | 0.003766 | 0.010277 | 0.005679 |
| 13 | (262.96, 282.88] | 753 | 0.232404 | 0.025927 | 175.0 | 578.0 | 0.024886 | 0.026260 | 0.666649 | 0.006242 | 0.017033 | 0.005679 |
| 14 | (282.88, 302.8] | 756 | 0.232804 | 0.026030 | 176.0 | 580.0 | 0.025028 | 0.026350 | 0.667742 | 0.000401 | 0.001092 | 0.005679 |
| 15 | (302.8, 322.72] | 651 | 0.239631 | 0.022415 | 156.0 | 495.0 | 0.022184 | 0.022489 | 0.686355 | 0.006827 | 0.018613 | 0.005679 |
| 16 | (322.72, 342.64] | 589 | 0.275042 | 0.020280 | 162.0 | 427.0 | 0.023038 | 0.019399 | 0.782777 | 0.035411 | 0.096422 | 0.005679 |
| 17 | (342.64, 362.56] | 611 | 0.229133 | 0.021038 | 140.0 | 471.0 | 0.019909 | 0.021398 | 0.657725 | 0.045910 | 0.125052 | 0.005679 |
| 18 | (362.56, 382.48] | 517 | 0.226306 | 0.017801 | 117.0 | 400.0 | 0.016638 | 0.018173 | 0.650010 | 0.002827 | 0.007715 | 0.005679 |
| 19 | (382.48, 402.4] | 554 | 0.187726 | 0.019075 | 104.0 | 450.0 | 0.014790 | 0.020444 | 0.544302 | 0.038580 | 0.105708 | 0.005679 |
| 20 | (402.4, 422.32] | 468 | 0.217949 | 0.016114 | 102.0 | 366.0 | 0.014505 | 0.016628 | 0.627183 | 0.030223 | 0.082881 | 0.005679 |
| 21 | (422.32, 442.24] | 452 | 0.247788 | 0.015563 | 112.0 | 340.0 | 0.015927 | 0.015447 | 0.708577 | 0.029839 | 0.081394 | 0.005679 |
| 22 | (442.24, 462.16] | 458 | 0.218341 | 0.015770 | 100.0 | 358.0 | 0.014221 | 0.016265 | 0.628254 | 0.029447 | 0.080323 | 0.005679 |
| 23 | (462.16, 482.08] | 406 | 0.266010 | 0.013979 | 108.0 | 298.0 | 0.015358 | 0.013539 | 0.758188 | 0.047669 | 0.129934 | 0.005679 |
| 24 | (482.08, 502.0] | 426 | 0.237089 | 0.014668 | 101.0 | 325.0 | 0.014363 | 0.014765 | 0.679426 | 0.028921 | 0.078762 | 0.005679 |
| 25 | (502.0, 521.92] | 351 | 0.230769 | 0.012086 | 81.0 | 270.0 | 0.011519 | 0.012267 | 0.662191 | 0.006320 | 0.017235 | 0.005679 |
| 26 | (521.92, 541.84] | 378 | 0.232804 | 0.013015 | 88.0 | 290.0 | 0.012514 | 0.013175 | 0.667742 | 0.002035 | 0.005551 | 0.005679 |
| 27 | (541.84, 561.76] | 357 | 0.277311 | 0.012292 | 99.0 | 258.0 | 0.014078 | 0.011721 | 0.788954 | 0.044507 | 0.121212 | 0.005679 |
| 28 | (561.76, 581.68] | 324 | 0.225309 | 0.011156 | 73.0 | 251.0 | 0.010381 | 0.011403 | 0.647288 | 0.052002 | 0.141665 | 0.005679 |
| 29 | (581.68, 601.6] | 353 | 0.271955 | 0.012154 | 96.0 | 257.0 | 0.013652 | 0.011676 | 0.774371 | 0.046646 | 0.127083 | 0.005679 |
| 30 | (601.6, 621.52] | 312 | 0.208333 | 0.010743 | 65.0 | 247.0 | 0.009243 | 0.011222 | 0.600876 | 0.063621 | 0.173495 | 0.005679 |
| 31 | (621.52, 641.44] | 280 | 0.242857 | 0.009641 | 68.0 | 212.0 | 0.009670 | 0.009632 | 0.695145 | 0.034524 | 0.094269 | 0.005679 |
| 32 | (641.44, 661.36] | 302 | 0.231788 | 0.010398 | 70.0 | 232.0 | 0.009954 | 0.010540 | 0.664970 | 0.011069 | 0.030175 | 0.005679 |
| 33 | (661.36, 681.28] | 251 | 0.231076 | 0.008642 | 58.0 | 193.0 | 0.008248 | 0.008768 | 0.663027 | 0.000712 | 0.001943 | 0.005679 |
| 34 | (681.28, 701.2] | 320 | 0.246875 | 0.011018 | 79.0 | 241.0 | 0.011234 | 0.010949 | 0.706091 | 0.015799 | 0.043064 | 0.005679 |
| 35 | (701.2, 721.12] | 277 | 0.231047 | 0.009538 | 64.0 | 213.0 | 0.009101 | 0.009677 | 0.662948 | 0.015828 | 0.043142 | 0.005679 |
| 36 | (721.12, 741.04] | 272 | 0.253676 | 0.009365 | 69.0 | 203.0 | 0.009812 | 0.009223 | 0.724613 | 0.022630 | 0.061665 | 0.005679 |
| 37 | (741.04, 760.96] | 239 | 0.271967 | 0.008229 | 65.0 | 174.0 | 0.009243 | 0.007905 | 0.774403 | 0.018290 | 0.049790 | 0.005679 |
| 38 | (760.96, 780.88] | 210 | 0.233333 | 0.007231 | 49.0 | 161.0 | 0.006968 | 0.007315 | 0.669185 | 0.038633 | 0.105218 | 0.005679 |
| 39 | (780.88, 800.8] | 241 | 0.248963 | 0.008298 | 60.0 | 181.0 | 0.008532 | 0.008223 | 0.711777 | 0.015629 | 0.042592 | 0.005679 |
| 40 | (800.8, 820.72] | 239 | 0.259414 | 0.008229 | 62.0 | 177.0 | 0.008817 | 0.008041 | 0.740234 | 0.010452 | 0.028457 | 0.005679 |
| 41 | (820.72, 840.64] | 215 | 0.283721 | 0.007403 | 61.0 | 154.0 | 0.008675 | 0.006997 | 0.806410 | 0.024307 | 0.066176 | 0.005679 |
| 42 | (840.64, 860.56] | 200 | 0.250000 | 0.006886 | 50.0 | 150.0 | 0.007110 | 0.006815 | 0.714602 | 0.033721 | 0.091808 | 0.005679 |
| 43 | (860.56, 880.48] | 219 | 0.251142 | 0.007541 | 55.0 | 164.0 | 0.007821 | 0.007451 | 0.717711 | 0.001142 | 0.003109 | 0.005679 |
| 44 | (880.48, 900.4] | 212 | 0.231132 | 0.007300 | 49.0 | 163.0 | 0.006968 | 0.007405 | 0.663181 | 0.020009 | 0.054530 | 0.005679 |
| 45 | (900.4, 920.32] | 191 | 0.235602 | 0.006576 | 45.0 | 146.0 | 0.006399 | 0.006633 | 0.675372 | 0.004470 | 0.012191 | 0.005679 |
| 46 | (920.32, 940.24] | 173 | 0.179191 | 0.005957 | 31.0 | 142.0 | 0.004408 | 0.006451 | 0.520778 | 0.056411 | 0.154594 | 0.005679 |
| 47 | (940.24, 960.16] | 166 | 0.234940 | 0.005716 | 39.0 | 127.0 | 0.005546 | 0.005770 | 0.673566 | 0.055749 | 0.152788 | 0.005679 |
| 48 | (960.16, 980.08] | 159 | 0.276730 | 0.005475 | 44.0 | 115.0 | 0.006257 | 0.005225 | 0.787371 | 0.041790 | 0.113805 | 0.005679 |
| 49 | (980.08, 1000.0] | 167 | 0.245509 | 0.005750 | 41.0 | 126.0 | 0.005830 | 0.005724 | 0.702370 | 0.031221 | 0.085001 | 0.005679 |
plot_by_woe(df_temp.iloc[0 : 50, : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0', '0-110', '110-300', '300-580', '580-1000', '>1000'.
# '> 84' will be the reference category.
df_inputs_prepr['tot_coll_amt:0'] = np.where((df_inputs_prepr['tot_coll_amt'] == 0), 1, 0)
df_inputs_prepr['tot_coll_amt:0-110'] = np.where((df_inputs_prepr['tot_coll_amt'] > 0) & (df_inputs_prepr['tot_coll_amt'] <= 110), 1, 0)
df_inputs_prepr['tot_coll_amt:110-300'] = np.where((df_inputs_prepr['tot_coll_amt'] > 110) & (df_inputs_prepr['tot_coll_amt'] <= 300), 1, 0)
df_inputs_prepr['tot_coll_amt:300-580'] = np.where((df_inputs_prepr['tot_coll_amt'] > 300) & (df_inputs_prepr['tot_coll_amt'] <= 580), 1, 0)
df_inputs_prepr['tot_coll_amt:580-1000'] = np.where((df_inputs_prepr['tot_coll_amt'] > 580) & (df_inputs_prepr['tot_coll_amt'] <= 1000), 1, 0)
df_inputs_prepr['tot_coll_amt:>1000'] = np.where((df_inputs_prepr['tot_coll_amt'] > 1000), 1, 0)
Variable: 'mort_acc'¶
df_inputs_prepr['mort_acc'].unique()
array([ 0., 2., 3., 6., 5., 4., 1., 7., 10., 9., 8., 14., 11.,
17., 16., 12., 13., 15., 21., 19., 24., 22., 20., 18., 29., 27.,
34., 28., 31., 23., 35., 25.])
# 'fico_range_high'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'mort_acc', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| mort_acc | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 116758 | 0.246330 | 0.425760 | 28761.0 | 87997.0 | 0.489116 | 0.408468 | 0.787294 | NaN | NaN | inf |
| 1 | 1.0 | 46100 | 0.219176 | 0.168105 | 10104.0 | 35996.0 | 0.171831 | 0.167088 | 0.707242 | 0.027154 | 0.080052 | inf |
| 2 | 2.0 | 38601 | 0.197016 | 0.140759 | 7605.0 | 30996.0 | 0.129332 | 0.143878 | 0.641275 | 0.022160 | 0.065967 | inf |
| 3 | 3.0 | 28153 | 0.182005 | 0.102661 | 5124.0 | 23029.0 | 0.087140 | 0.106897 | 0.596183 | 0.015010 | 0.045092 | inf |
| 4 | 4.0 | 19247 | 0.167662 | 0.070185 | 3227.0 | 16020.0 | 0.054879 | 0.074362 | 0.552733 | 0.014343 | 0.043450 | inf |
| 5 | 5.0 | 11714 | 0.161687 | 0.042715 | 1894.0 | 9820.0 | 0.032210 | 0.045583 | 0.534515 | 0.005976 | 0.018218 | inf |
| 6 | 6.0 | 6571 | 0.160554 | 0.023961 | 1055.0 | 5516.0 | 0.017942 | 0.025604 | 0.531053 | 0.001133 | 0.003462 | inf |
| 7 | 7.0 | 3503 | 0.151870 | 0.012774 | 532.0 | 2971.0 | 0.009047 | 0.013791 | 0.504426 | 0.008684 | 0.026627 | inf |
| 8 | 8.0 | 1656 | 0.141304 | 0.006039 | 234.0 | 1422.0 | 0.003979 | 0.006601 | 0.471805 | 0.010565 | 0.032621 | inf |
| 9 | 9.0 | 879 | 0.142207 | 0.003205 | 125.0 | 754.0 | 0.002126 | 0.003500 | 0.474602 | 0.000903 | 0.002797 | inf |
| 10 | 10.0 | 443 | 0.115124 | 0.001615 | 51.0 | 392.0 | 0.000867 | 0.001820 | 0.389778 | 0.027083 | 0.084824 | inf |
| 11 | 11.0 | 251 | 0.163347 | 0.000915 | 41.0 | 210.0 | 0.000697 | 0.000975 | 0.539583 | 0.048222 | 0.149805 | inf |
| 12 | 12.0 | 127 | 0.125984 | 0.000463 | 16.0 | 111.0 | 0.000272 | 0.000515 | 0.424024 | 0.037362 | 0.115558 | inf |
| 13 | 13.0 | 67 | 0.134328 | 0.000244 | 9.0 | 58.0 | 0.000153 | 0.000269 | 0.450122 | 0.008344 | 0.026097 | inf |
| 14 | 14.0 | 60 | 0.116667 | 0.000219 | 7.0 | 53.0 | 0.000119 | 0.000246 | 0.394662 | 0.017662 | 0.055459 | inf |
| 15 | 15.0 | 34 | 0.117647 | 0.000124 | 4.0 | 30.0 | 0.000068 | 0.000139 | 0.397763 | 0.000980 | 0.003101 | inf |
| 16 | 16.0 | 20 | 0.250000 | 0.000073 | 5.0 | 15.0 | 0.000085 | 0.000070 | 0.798060 | 0.132353 | 0.400297 | inf |
| 17 | 17.0 | 9 | 0.111111 | 0.000033 | 1.0 | 8.0 | 0.000017 | 0.000037 | 0.377039 | 0.138889 | 0.421022 | inf |
| 18 | 18.0 | 6 | 0.000000 | 0.000022 | 0.0 | 6.0 | 0.000000 | 0.000028 | 0.000000 | 0.111111 | 0.377039 | inf |
| 19 | 19.0 | 6 | 0.333333 | 0.000022 | 2.0 | 4.0 | 0.000034 | 0.000019 | 1.040928 | 0.333333 | 1.040928 | inf |
| 20 | 20.0 | 9 | 0.222222 | 0.000033 | 2.0 | 7.0 | 0.000034 | 0.000032 | 0.716262 | 0.111111 | 0.324666 | inf |
| 21 | 21.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.277778 | 0.823544 | inf |
| 22 | 22.0 | 4 | 0.250000 | 0.000015 | 1.0 | 3.0 | 0.000017 | 0.000014 | 0.798060 | 0.250000 | 0.741746 | inf |
| 23 | 23.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.250000 | 0.798060 | inf |
| 24 | 24.0 | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.000000 | 0.000000 | inf |
| 25 | 25.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 26 | 27.0 | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.000000 | 0.000000 | inf |
| 27 | 28.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
| 28 | 29.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 29 | 31.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 1.000000 | inf | inf |
| 30 | 34.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | inf |
| 31 | 35.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
plot_by_woe(df_temp.iloc[6 : 50, : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '0', '1', '2', '3-5', '6-12', '13-18', '>=19'.
# '>=19' will be the reference category
df_inputs_prepr['mort_acc:0'] = np.where(df_inputs_prepr['mort_acc'].isin([0]), 1, 0)
df_inputs_prepr['mort_acc:1'] = np.where(df_inputs_prepr['mort_acc'].isin([1]), 1, 0)
df_inputs_prepr['mort_acc:2'] = np.where(df_inputs_prepr['mort_acc'].isin([2]), 1, 0)
df_inputs_prepr['mort_acc:3-5'] = np.where(df_inputs_prepr['mort_acc'].isin(range(3, 6)), 1, 0)
df_inputs_prepr['mort_acc:6-12'] = np.where(df_inputs_prepr['mort_acc'].isin(range(6, 13)), 1, 0)
df_inputs_prepr['mort_acc:13-18'] = np.where(df_inputs_prepr['mort_acc'].isin(range(13, 19)), 1, 0)
df_inputs_prepr['mort_acc:>=19'] = np.where(df_inputs_prepr['mort_acc'].isin(range(19, 200)), 1, 0)
Variable: 'months_since_last_credit_pull'¶
df_inputs_prepr['months_since_last_credit_pull'].nunique()
128
# 'months_since_last_credit_pull'
# the categories of everyone with 'months_since_last_credit_pull' less or equal 100.
df_inputs_prepr_temp = df_inputs_prepr.loc[(df_inputs_prepr['months_since_last_credit_pull'] <= 1000), : ]
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
# df_inputs_prepr_temp
df_inputs_prepr_temp['months_since_last_credit_pull_factor'] = pd.cut(df_inputs_prepr_temp['months_since_last_credit_pull'], 40)
# Here we do fine-classing: using the 'cut' method, we split the variable into 50 categories by its values.
df_temp = woe_ordered_continuous(df_inputs_prepr_temp, 'months_since_last_credit_pull_factor', df_targets_prepr[df_inputs_prepr_temp.index])
# We calculate weight of evidence.
df_temp
C:\Users\pc\AppData\Local\Temp\ipykernel_6728\790836033.py:7: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()` df_inputs_prepr_temp['months_since_last_credit_pull_factor'] = pd.cut(df_inputs_prepr_temp['months_since_last_credit_pull'], 40) C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:4: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(), C:\Users\pc\AppData\Local\Temp\ipykernel_6728\1025164655.py:5: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
| months_since_last_credit_pull_factor | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | (21.857, 25.575] | 134899 | 0.166554 | 0.491912 | 22468.0 | 112431.0 | 0.382096 | 0.521886 | 0.549360 | NaN | NaN | 0.264175 |
| 1 | (25.575, 29.15] | 33208 | 0.188539 | 0.121094 | 6261.0 | 26947.0 | 0.106476 | 0.125084 | 0.615855 | 0.021985 | 0.066495 | 0.264175 |
| 2 | (29.15, 32.725] | 17636 | 0.229474 | 0.064310 | 4047.0 | 13589.0 | 0.068824 | 0.063078 | 0.737689 | 0.040935 | 0.121834 | 0.264175 |
| 3 | (32.725, 36.3] | 18135 | 0.207996 | 0.066130 | 3772.0 | 14363.0 | 0.064147 | 0.066671 | 0.674043 | 0.021478 | 0.063646 | 0.264175 |
| 4 | (36.3, 39.875] | 7034 | 0.263861 | 0.025650 | 1856.0 | 5178.0 | 0.031564 | 0.024035 | 0.838636 | 0.055866 | 0.164593 | 0.264175 |
| 5 | (39.875, 43.45] | 13869 | 0.299156 | 0.050574 | 4149.0 | 9720.0 | 0.070559 | 0.045119 | 0.941510 | 0.035295 | 0.102874 | 0.264175 |
| 6 | (43.45, 47.025] | 10270 | 0.289289 | 0.037450 | 2971.0 | 7299.0 | 0.050525 | 0.033881 | 0.912794 | 0.009867 | 0.028716 | 0.264175 |
| 7 | (47.025, 50.6] | 9154 | 0.495193 | 0.033380 | 4533.0 | 4621.0 | 0.077089 | 0.021450 | 1.524733 | 0.205904 | 0.611939 | 0.264175 |
| 8 | (50.6, 54.175] | 12921 | 0.627970 | 0.047117 | 8114.0 | 4807.0 | 0.137989 | 0.022313 | 1.971875 | 0.132777 | 0.447142 | 0.264175 |
| 9 | (54.175, 57.75] | 2569 | 0.027637 | 0.009368 | 71.0 | 2498.0 | 0.001207 | 0.011595 | 0.099059 | 0.600333 | 1.872816 | 0.264175 |
| 10 | (57.75, 61.325] | 3677 | 0.034267 | 0.013408 | 126.0 | 3551.0 | 0.002143 | 0.016483 | 0.122216 | 0.006630 | 0.023157 | 0.264175 |
| 11 | (61.325, 64.9] | 1809 | 0.030404 | 0.006597 | 55.0 | 1754.0 | 0.000935 | 0.008142 | 0.108748 | 0.003864 | 0.013468 | 0.264175 |
| 12 | (64.9, 68.475] | 2061 | 0.040272 | 0.007515 | 83.0 | 1978.0 | 0.001412 | 0.009182 | 0.143004 | 0.009868 | 0.034255 | 0.264175 |
| 13 | (68.475, 72.05] | 1657 | 0.036210 | 0.006042 | 60.0 | 1597.0 | 0.001020 | 0.007413 | 0.128961 | 0.004062 | 0.014042 | 0.264175 |
| 14 | (72.05, 75.625] | 890 | 0.052809 | 0.003245 | 47.0 | 843.0 | 0.000799 | 0.003913 | 0.185867 | 0.016599 | 0.056906 | 0.264175 |
| 15 | (75.625, 79.2] | 1020 | 0.048039 | 0.003719 | 49.0 | 971.0 | 0.000833 | 0.004507 | 0.169643 | 0.004770 | 0.016224 | 0.264175 |
| 16 | (79.2, 82.775] | 590 | 0.047458 | 0.002151 | 28.0 | 562.0 | 0.000476 | 0.002609 | 0.167658 | 0.000582 | 0.001985 | 0.264175 |
| 17 | (82.775, 86.35] | 626 | 0.046326 | 0.002283 | 29.0 | 597.0 | 0.000493 | 0.002771 | 0.163791 | 0.001132 | 0.003867 | 0.264175 |
| 18 | (86.35, 89.925] | 354 | 0.033898 | 0.001291 | 12.0 | 342.0 | 0.000204 | 0.001588 | 0.120934 | 0.012428 | 0.042857 | 0.264175 |
| 19 | (89.925, 93.5] | 404 | 0.024752 | 0.001473 | 10.0 | 394.0 | 0.000170 | 0.001829 | 0.088914 | 0.009146 | 0.032020 | 0.264175 |
| 20 | (93.5, 97.075] | 354 | 0.039548 | 0.001291 | 14.0 | 340.0 | 0.000238 | 0.001578 | 0.140507 | 0.014796 | 0.051593 | 0.264175 |
| 21 | (97.075, 100.65] | 181 | 0.038674 | 0.000660 | 7.0 | 174.0 | 0.000119 | 0.000808 | 0.137489 | 0.000874 | 0.003018 | 0.264175 |
| 22 | (100.65, 104.225] | 219 | 0.045662 | 0.000799 | 10.0 | 209.0 | 0.000170 | 0.000970 | 0.161520 | 0.006988 | 0.024031 | 0.264175 |
| 23 | (104.225, 107.8] | 106 | 0.018868 | 0.000387 | 2.0 | 104.0 | 0.000034 | 0.000483 | 0.068084 | 0.026794 | 0.093436 | 0.264175 |
| 24 | (107.8, 111.375] | 156 | 0.076923 | 0.000569 | 12.0 | 144.0 | 0.000204 | 0.000668 | 0.266438 | 0.058055 | 0.198354 | 0.264175 |
| 25 | (111.375, 114.95] | 60 | 0.066667 | 0.000219 | 4.0 | 56.0 | 0.000068 | 0.000260 | 0.232454 | 0.010256 | 0.033985 | 0.264175 |
| 26 | (114.95, 118.525] | 104 | 0.038462 | 0.000379 | 4.0 | 100.0 | 0.000068 | 0.000464 | 0.136755 | 0.028205 | 0.095698 | 0.264175 |
| 27 | (118.525, 122.1] | 96 | 0.010417 | 0.000350 | 1.0 | 95.0 | 0.000017 | 0.000441 | 0.037840 | 0.028045 | 0.098915 | 0.264175 |
| 28 | (122.1, 125.675] | 50 | 0.080000 | 0.000182 | 4.0 | 46.0 | 0.000068 | 0.000214 | 0.276556 | 0.069583 | 0.238716 | 0.264175 |
| 29 | (125.675, 129.25] | 38 | 0.000000 | 0.000139 | 0.0 | 38.0 | 0.000000 | 0.000176 | 0.000000 | 0.080000 | 0.276556 | 0.264175 |
| 30 | (129.25, 132.825] | 19 | 0.000000 | 0.000069 | 0.0 | 19.0 | 0.000000 | 0.000088 | 0.000000 | 0.000000 | 0.000000 | 0.264175 |
| 31 | (132.825, 136.4] | 25 | 0.040000 | 0.000091 | 1.0 | 24.0 | 0.000017 | 0.000111 | 0.142067 | 0.040000 | 0.142067 | 0.264175 |
| 32 | (136.4, 139.975] | 12 | 0.083333 | 0.000044 | 1.0 | 11.0 | 0.000017 | 0.000051 | 0.287479 | 0.043333 | 0.145412 | 0.264175 |
| 33 | (139.975, 143.55] | 12 | 0.000000 | 0.000044 | 0.0 | 12.0 | 0.000000 | 0.000056 | 0.000000 | 0.083333 | 0.287479 | 0.264175 |
| 34 | (143.55, 147.125] | 8 | 0.125000 | 0.000029 | 1.0 | 7.0 | 0.000017 | 0.000032 | 0.420934 | 0.125000 | 0.420934 | 0.264175 |
| 35 | (147.125, 150.7] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.125000 | 0.420934 | 0.264175 |
| 36 | (150.7, 154.275] | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.000000 | 0.000000 | 0.264175 |
| 37 | (154.275, 157.85] | 0 | NaN | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.264175 |
| 38 | (157.85, 161.425] | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | NaN | NaN | 0.264175 |
| 39 | (161.425, 165.0] | 6 | 0.000000 | 0.000022 | 0.0 | 6.0 | 0.000000 | 0.000028 | 0.000000 | 0.000000 | 0.000000 | 0.264175 |
plot_by_woe(df_temp.iloc[ 10: 50, : ], 90)
#plot_by_woe(df_temp, 90)
# We plot the weight of evidence values.
# We create the following categories: '<=30', '30-48', '48-55', '55-110', '>110'.
# '> 110' will be the reference category.
df_inputs_prepr['months_since_last_credit_pull:<=30'] = np.where((df_inputs_prepr['months_since_last_credit_pull'] <= 30), 1, 0)
df_inputs_prepr['months_since_last_credit_pull:30-48'] = np.where((df_inputs_prepr['months_since_last_credit_pull'] > 30) & (df_inputs_prepr['months_since_last_credit_pull'] <= 48), 1, 0)
df_inputs_prepr['months_since_last_credit_pull:48-55'] = np.where((df_inputs_prepr['months_since_last_credit_pull'] > 48) & (df_inputs_prepr['months_since_last_credit_pull'] <= 55), 1, 0)
df_inputs_prepr['months_since_last_credit_pull:55-110'] = np.where((df_inputs_prepr['months_since_last_credit_pull'] > 55) & (df_inputs_prepr['months_since_last_credit_pull'] <= 110), 1, 0)
df_inputs_prepr['months_since_last_credit_pull:>110'] = np.where((df_inputs_prepr['months_since_last_credit_pull'] > 110), 1, 0)
Variable: 'total_public_records'¶
df_inputs_prepr['total_public_records'].unique()
array([ 0., 8., 1., 2., 14., 4., 3., 9., 10., 6., 93., 5., 11.,
7., 12., 16., 18., 13., 19., 21., 49., 20., 25., 37., 17., 15.,
22., 79., 88., 24., 41., 31., 29., 23., 91.])
# 'fico_range_high'
df_temp = woe_ordered_continuous(df_inputs_prepr, 'total_public_records', df_targets_prepr)
# We calculate weight of evidence.
df_temp
| total_public_records | n_obs | prop_good | prop_n_obs | n_good | n_bad | prop_n_good | prop_n_bad | WoE | diff_prop_good | diff_WoE | IV | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 227853 | 0.209284 | 0.830871 | 47686.0 | 180167.0 | 0.810959 | 0.836306 | 0.677877 | NaN | NaN | inf |
| 1 | 1.0 | 34418 | 0.237347 | 0.125506 | 8169.0 | 26249.0 | 0.138924 | 0.121844 | 0.760891 | 0.028063 | 0.083014 | inf |
| 2 | 2.0 | 7028 | 0.250427 | 0.025628 | 1760.0 | 5268.0 | 0.029931 | 0.024453 | 0.799312 | 0.013080 | 0.038421 | inf |
| 3 | 3.0 | 1674 | 0.232378 | 0.006104 | 389.0 | 1285.0 | 0.006615 | 0.005965 | 0.746254 | 0.018049 | 0.053058 | inf |
| 4 | 4.0 | 1471 | 0.235894 | 0.005364 | 347.0 | 1124.0 | 0.005901 | 0.005217 | 0.756614 | 0.003516 | 0.010360 | inf |
| 5 | 5.0 | 472 | 0.233051 | 0.001721 | 110.0 | 362.0 | 0.001871 | 0.001680 | 0.748239 | 0.002843 | 0.008376 | inf |
| 6 | 6.0 | 546 | 0.260073 | 0.001991 | 142.0 | 404.0 | 0.002415 | 0.001875 | 0.827560 | 0.027022 | 0.079322 | inf |
| 7 | 7.0 | 157 | 0.299363 | 0.000573 | 47.0 | 110.0 | 0.000799 | 0.000511 | 0.942112 | 0.039290 | 0.114551 | inf |
| 8 | 8.0 | 215 | 0.218605 | 0.000784 | 47.0 | 168.0 | 0.000799 | 0.000780 | 0.705550 | 0.080758 | 0.236562 | inf |
| 9 | 9.0 | 81 | 0.197531 | 0.000295 | 16.0 | 65.0 | 0.000272 | 0.000302 | 0.642817 | 0.021074 | 0.062733 | inf |
| 10 | 10.0 | 112 | 0.232143 | 0.000408 | 26.0 | 86.0 | 0.000442 | 0.000399 | 0.745562 | 0.034612 | 0.102745 | inf |
| 11 | 11.0 | 27 | 0.259259 | 0.000098 | 7.0 | 20.0 | 0.000119 | 0.000093 | 0.825179 | 0.027116 | 0.079617 | inf |
| 12 | 12.0 | 69 | 0.289855 | 0.000252 | 20.0 | 49.0 | 0.000340 | 0.000227 | 0.914442 | 0.030596 | 0.089262 | inf |
| 13 | 13.0 | 16 | 0.125000 | 0.000058 | 2.0 | 14.0 | 0.000034 | 0.000065 | 0.420934 | 0.164855 | 0.493508 | inf |
| 14 | 14.0 | 20 | 0.500000 | 0.000073 | 10.0 | 10.0 | 0.000170 | 0.000046 | 1.539806 | 0.375000 | 1.118872 | inf |
| 15 | 15.0 | 10 | 0.100000 | 0.000036 | 1.0 | 9.0 | 0.000017 | 0.000042 | 0.341514 | 0.400000 | 1.198292 | inf |
| 16 | 16.0 | 13 | 0.307692 | 0.000047 | 4.0 | 9.0 | 0.000068 | 0.000042 | 0.966339 | 0.207692 | 0.624825 | inf |
| 17 | 17.0 | 2 | 0.500000 | 0.000007 | 1.0 | 1.0 | 0.000017 | 0.000005 | 1.539806 | 0.192308 | 0.573467 | inf |
| 18 | 18.0 | 8 | 0.250000 | 0.000029 | 2.0 | 6.0 | 0.000034 | 0.000028 | 0.798060 | 0.250000 | 0.741746 | inf |
| 19 | 19.0 | 6 | 0.166667 | 0.000022 | 1.0 | 5.0 | 0.000017 | 0.000023 | 0.549702 | 0.083333 | 0.248358 | inf |
| 20 | 20.0 | 12 | 0.416667 | 0.000044 | 5.0 | 7.0 | 0.000085 | 0.000032 | 1.285622 | 0.250000 | 0.735920 | inf |
| 21 | 21.0 | 3 | 0.000000 | 0.000011 | 0.0 | 3.0 | 0.000000 | 0.000014 | 0.000000 | 0.416667 | 1.285622 | inf |
| 22 | 22.0 | 5 | 0.200000 | 0.000018 | 1.0 | 4.0 | 0.000017 | 0.000019 | 0.650199 | 0.200000 | 0.650199 | inf |
| 23 | 23.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 0.800000 | inf | inf |
| 24 | 24.0 | 4 | 0.500000 | 0.000015 | 2.0 | 2.0 | 0.000034 | 0.000009 | 1.539806 | 0.500000 | inf | inf |
| 25 | 25.0 | 2 | 0.000000 | 0.000007 | 0.0 | 2.0 | 0.000000 | 0.000009 | 0.000000 | 0.500000 | 1.539806 | inf |
| 26 | 29.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 27 | 31.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 0.000000 | NaN | inf |
| 28 | 37.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 0.000000 | NaN | inf |
| 29 | 41.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 1.000000 | inf | inf |
| 30 | 49.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 1.000000 | inf | inf |
| 31 | 79.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 0.000000 | NaN | inf |
| 32 | 88.0 | 1 | 1.000000 | 0.000004 | 1.0 | 0.0 | 0.000017 | 0.000000 | inf | 0.000000 | NaN | inf |
| 33 | 91.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 1.000000 | inf | inf |
| 34 | 93.0 | 1 | 0.000000 | 0.000004 | 0.0 | 1.0 | 0.000000 | 0.000005 | 0.000000 | 0.000000 | 0.000000 | inf |
plot_by_woe(df_temp.iloc[: 30, : ], 90)
# We plot the weight of evidence values.
# Categories
# '0', '1-3', '4-12', '>=13'
df_inputs_prepr['total_public_records:0'] = np.where((df_inputs_prepr['total_public_records'] == 0), 1, 0)
df_inputs_prepr['total_public_records:1-3'] = np.where((df_inputs_prepr['total_rev_hi_lim'] >= 1) & (df_inputs_prepr['total_rev_hi_lim'] <= 3), 1, 0)
df_inputs_prepr['total_public_records:4-12'] = np.where((df_inputs_prepr['total_rev_hi_lim'] >= 4) & (df_inputs_prepr['total_rev_hi_lim'] <= 12), 1, 0)
df_inputs_prepr['total_public_records:>=13'] = np.where((df_inputs_prepr['total_public_records'] >= 13), 1, 0)
D. Final list of features to consider in the credit risk model.¶
Final_list_features = ['grade:A', 'grade:B', 'grade:C', 'grade:D', 'grade:E', 'grade:F',
'grade:G', 'home_ownership:MORTGAGE', 'home_ownership:OWN',
'home_ownership:RENT_OTHER_NONE_ANY',
'addr_state:ND_NE_IA_NV_FL_HI_AL', 'addr_state:NM_VA',
'addr_state:OK_TN_MO_LA_MD_NC', 'addr_state:UT_KY_AZ_NJ',
'addr_state:AR_MI_PA_OH_MN', 'addr_state:RI_MA_DE_SD_IN',
'addr_state:GA_WA_OR', 'addr_state:WI_MT', 'addr_state:IL_CT',
'addr_state:KS_SC_CO_VT_AK_MS', 'addr_state:WV_NH_WY_DC_ME_ID',
'verification_status:Not Verified',
'verification_status:Source Verified',
'verification_status:Verified', 'purpose:debt_consolidation',
'purpose:credit_card', 'purpose:sm_b__mov__ren_en__house__medic',
'purpose:other__vacat__maj_purch', 'home_impr__educ__car__wed',
'initial_list_status:f', 'initial_list_status:w',
'application_type:Individual', 'application_type:Joint App',
'hardship_flag:N', 'hardship_flag:Y', 'disbursement_method:Cash',
'disbursement_method:DirectPay', 'debt_settlement_flag:N',
'debt_settlement_flag:Y', 'term:36', 'term:60', 'num_tl_120dpd_2m:0', 'num_tl_120dpd_2m:1',
'num_tl_120dpd_2m:2-6', 'num_tl_30dpd:0', 'num_tl_30dpd:1',
'num_tl_30dpd:2-4', 'delinq_record_risk_score:0',
'delinq_record_risk_score:1-2', 'delinq_record_risk_score:3-4',
'delinq_record_risk_score:5-7', 'log_annual_inc:<20K',
'annual_inc:20K-30K', 'annual_inc:30K-40K', 'annual_inc:40K-50K',
'annual_inc:50K-60K', 'annual_inc:60K-70K', 'annual_inc:70K-80K',
'annual_inc:80K-90K', 'annual_inc:90K-100K',
'annual_inc:100K-120K', 'annual_inc:120K-140K', 'annual_inc:>140K',
'loan_amnt:<2500', 'loan_amnt:2500-6500', 'loan_amnt:6500-9500',
'loan_amnt:9500-11000', 'loan_amnt:11000-17500',
'loan_amnt:17500-28500', 'loan_amnt:>=28500', 'int_rate:<=8',
'int_rate:8-12.5', 'int_rate:12.5-16.5', 'int_rate:16.5-20',
'int_rate:20-23.5', 'int_rate:>23.5', 'emp_length_int:0',
'emp_length_int:1', 'emp_length_int:2-4', 'emp_length_int:5-7',
'emp_length_int:8-9', 'emp_length_int:10', 'dti:<=10', 'dti:10-20',
'dti:20-30', 'dti:30-40', 'dti:>40', 'min_mths_since_delinquency',
'min_mths_since_delinquency:Missing',
'min_mths_since_delinquency:<=20',
'min_mths_since_delinquency:20-40',
'min_mths_since_delinquency:40-80',
'min_mths_since_delinquency:>80',
'mths_since_earliest_cr_line:<=120',
'mths_since_earliest_cr_line:121-200',
'mths_since_earliest_cr_line:201-260',
'mths_since_earliest_cr_line:261-320',
'mths_since_earliest_cr_line:321-400',
'mths_since_earliest_cr_line:401-600',
'mths_since_earliest_cr_line:>=601', 'delinq_2yrs:0',
'delinq_2yrs:1', 'delinq_2yrs:2-9', 'delinq_2yrs:>=10',
'inq_last_6mths:0', 'inq_last_6mths:1-2', 'inq_last_6mths:3-5',
'inq_last_6mths:>=6', 'collections_12_mths_ex_med:0',
'collections_12_mths_ex_med:1', 'collections_12_mths_ex_med:>=2',
'chargeoff_within_12_mths:0', 'chargeoff_within_12_mths:1',
'chargeoff_within_12_mths:>=2', 'total_acc:<=20',
'total_acc:21-56', 'total_acc:>=57', 'delinq_amnt:0',
'delinq_amnt:>=1', 'num_accts_ever_120_pd:0',
'num_accts_ever_120_pd:1-11', 'num_accts_ever_120_pd:>=12',
'num_tl_90g_dpd_24m:0', 'num_tl_90g_dpd_24m:1-4',
'num_tl_90g_dpd_24m:>=5', 'revol_bal:<=8k', 'revol_bal:8-22k',
'revol_bal:22-35k', 'revol_bal:35-60k', 'revol_bal:60-100k',
'revol_bal:>100k', 'total_bal_il:=0', 'total_bal_il:0-18k',
'total_bal_il:18-30k', 'total_bal_il:30-70k',
'total_bal_il:70-200k', 'total_bal_il:>200k', 'max_bal_bc:=0',
'max_bal_bc:0-8k', 'max_bal_bc:8-16k', 'max_bal_bc:16-26k',
'max_bal_bc:26-50k', 'max_bal_bc:>50k', 'avg_cur_bal:0-7k',
'avg_cur_bal:7-15k', 'avg_cur_bal:15-30k', 'avg_cur_bal:30-50k',
'avg_cur_bal:50-100k', 'avg_cur_bal:>100k', 'bc_open_to_buy:0-5k',
'bc_open_to_buy:5-15k', 'bc_open_to_buy:15-30k',
'bc_open_to_buy:30-50k', 'bc_open_to_buy:50-100k',
'bc_open_to_buy:>100k', 'revol_bal_to_bc_limit:0-0.6',
'revol_bal_to_bc_limit:0.6-1.2', 'revol_bal_to_bc_limit:1.2-3.6',
'revol_bal_to_bc_limit:3.6-5.5', 'revol_bal_to_bc_limit:5.5-10.',
'revol_bal_to_bc_limit:>10.', 'revol_bal_to_open_to_buy:0-2',
'revol_bal_to_open_to_buy:2-4', 'revol_bal_to_open_to_buy:4-20',
'revol_bal_to_open_to_buy:20-100', 'revol_bal_to_open_to_buy:>100',
'total_bal_ex_mort_to_inc:0-0.4', 'total_bal_ex_mort_to_inc:0.4-1',
'total_bal_ex_mort_to_inc:1-2.6', 'total_bal_ex_mort_to_inc:2.6-4.4',
'total_bal_ex_mort_to_inc:>4.4', 'total_balance_to_credit_ratio:0-0.05',
'total_balance_to_credit_ratio:0.05-0.2', 'total_balance_to_credit_ratio:0.2-0.4',
'total_balance_to_credit_ratio:0.4-0.7', 'total_balance_to_credit_ratio:0.7-1',
'total_balance_to_credit_ratio:1-1.4', 'total_balance_to_credit_ratio:>1.4',
'rev_to_il_limit_ratio:0-0.6', 'rev_to_il_limit_ratio:0.6-0.8',
'rev_to_il_limit_ratio:0.8-1.8', 'rev_to_il_limit_ratio:1.8-4.5',
'rev_to_il_limit_ratio:4.5-10', 'rev_to_il_limit_ratio:>10.',
'total_il_high_credit_limit:0-5k', 'total_il_high_credit_limit:5-10k',
'total_il_high_credit_limit:10-30k', 'total_il_high_credit_limit:30-35k',
'total_il_high_credit_limit:35-100k', 'total_il_high_credit_limit:>100k',
'tot_cur_bal:0-20k', 'tot_cur_bal:20-70k', 'tot_cur_bal:70-80k',
'tot_cur_bal:80-130k', 'tot_cur_bal:130-200k', 'tot_cur_bal:200-250k',
'tot_cur_bal:250-500k', 'tot_cur_bal:>500k', 'open_act_il:0',
'open_act_il:1-5', 'open_act_il:6-15', 'open_act_il:>=16',
'open_il_12m:0', 'open_il_12m:1-5', 'open_il_12m:>=6',
'num_actv_rev_tl:0', 'num_actv_rev_tl:1-5', 'num_actv_rev_tl:6-9',
'num_actv_rev_tl:10-13', 'num_actv_rev_tl:14-17',
'num_actv_rev_tl:18-26', 'num_actv_rev_tl:>=27', 'open_rv_12m:0',
'open_rv_12m:1-2', 'open_rv_12m:3-5', 'open_rv_12m:6-8',
'open_rv_12m:9-13', 'open_rv_12m:>=14', 'num_bc_tl:0',
'num_bc_tl:1-5', 'num_bc_tl:6-10', 'num_bc_tl:11-20',
'num_bc_tl:21-32', 'num_bc_tl:>=33', 'open_acc_6m:0',
'open_acc_6m:1-3', 'open_acc_6m:4-7', 'open_acc_6m:>=8',
'acc_open_past_24mths:0-3', 'acc_open_past_24mths:4-7',
'acc_open_past_24mths:8-13', 'acc_open_past_24mths:14-21',
'acc_open_past_24mths:>=22', 'total_cu_tl:0', 'total_cu_tl:1-7',
'total_cu_tl:8-17', 'total_cu_tl:>=18', 'inq_last_12m:0',
'inq_last_12m:1-4', 'inq_last_12m:5-9', 'inq_last_12m:10-16',
'inq_last_12m:>=17', 'mths_since_recent_inq:Missing',
'mths_since_recent_inq:0-1', 'mths_since_recent_inq:2-3',
'mths_since_recent_inq:4-6', 'mths_since_recent_inq:7-10',
'mths_since_recent_inq:11-15', 'mths_since_recent_inq:>=16',
'out_prncp:=0', 'out_prncp:>0', 'last_pymnt_amnt:<=200',
'last_pymnt_amnt:200-700', 'last_pymnt_amnt:700-1000',
'last_pymnt_amnt:1000-1500', 'last_pymnt_amnt:1500-2600',
'last_pymnt_amnt:2600-10000', 'last_pymnt_amnt:>10000',
'principal_paid_ratio:<=0.3', 'principal_paid_ratio:0.3-0.45',
'principal_paid_ratio:0.45-0.6', 'principal_paid_ratio:0.6-1',
'principal_paid_ratio:=1', 'fico_range_high:<=680',
'fico_range_high:680-700', 'fico_range_high:700-720',
'fico_range_high:720-750', 'fico_range_high:750-795',
'fico_range_high:>795', 'last_fico_range_high:<=520',
'last_fico_range_high:520-550', 'last_fico_range_high:550-580',
'last_fico_range_high:580-610', 'last_fico_range_high:610-640',
'last_fico_range_high:640-670', 'last_fico_range_high:>670',
'mo_sin_rcnt_rev_tl_op:0-3', 'mo_sin_rcnt_rev_tl_op:3-6',
'mo_sin_rcnt_rev_tl_op:6-9', 'mo_sin_rcnt_rev_tl_op:9-20',
'mo_sin_rcnt_rev_tl_op:20-37', 'mo_sin_rcnt_rev_tl_op:37-63',
'mo_sin_rcnt_rev_tl_op:63-80', 'mo_sin_rcnt_rev_tl_op:80-140',
'mo_sin_rcnt_rev_tl_op:>140', 'mo_sin_rcnt_tl:0-2',
'mo_sin_rcnt_tl:2-5', 'mo_sin_rcnt_tl:5-6', 'mo_sin_rcnt_tl:6-10',
'mo_sin_rcnt_tl:10-15', 'mo_sin_rcnt_tl:15-20',
'mo_sin_rcnt_tl:20-50', 'mo_sin_rcnt_tl:>50',
'mths_since_rcnt_il:0-4', 'mths_since_rcnt_il:4-10',
'mths_since_rcnt_il:10-20', 'mths_since_rcnt_il:20-40',
'mths_since_rcnt_il:40-100', 'mths_since_rcnt_il:>100',
'mths_since_recent_bc:0-12', 'mths_since_recent_bc:12-32',
'mths_since_recent_bc:32-52', 'mths_since_recent_bc:52-68',
'mths_since_recent_bc:68-100', 'mths_since_recent_bc:100-130',
'mths_since_recent_bc:>130', 'mths_since_rcnt_il:Missing',
'mths_since_recent_revol_delinq:0-20', 'mths_since_recent_revol_delinq:20-34',
'mths_since_recent_revol_delinq:34-50', 'mths_since_recent_revol_delinq:50-84',
'mths_since_recent_revol_delinq:>84', 'mths_since_recent_revol_delinq:Missing',
'percent_bc_gt_75:0-4', 'percent_bc_gt_75:4-20', 'percent_bc_gt_75:20-40',
'percent_bc_gt_75:40-70', 'percent_bc_gt_75:70-96',
'percent_bc_gt_75:>96', 'pub_rec_bankruptcies:0',
'pub_rec_bankruptcies:1-3', 'pub_rec_bankruptcies:>4',
'tot_coll_amt:0', 'tot_coll_amt:0-110', 'tot_coll_amt:110-300',
'tot_coll_amt:300-580', 'tot_coll_amt:580-1000',
'tot_coll_amt:>1000', 'mort_acc:0', 'mort_acc:1', 'mort_acc:2',
'mort_acc:3-5', 'mort_acc:6-12', 'mort_acc:13-18', 'mort_acc:>=19',
'months_since_last_credit_pull:<=30', 'months_since_last_credit_pull:30-48',
'months_since_last_credit_pull:48-55', 'months_since_last_credit_pull:55-110',
'months_since_last_credit_pull:>110', 'total_public_records:0',
'total_public_records:1-3', 'total_public_records:4-12',
'total_public_records:>=13']
len(Final_list_features)
344
Feature Selection Process:¶
The final set of features selected for training the credit risk model consists of 344 variables. This refined list was obtained after a systematic feature selection process that involved removing variables with excessive missing values, eliminating features exhibiting high multicollinearity based on VIF analysis, and transforming relevant continuous variables into categorical formats using Weight of Evidence (WoE) encoding guided by both predictive power and variable distribution. This approach ensures that the retained features contribute meaningful, non-redundant information to the model while supporting better interpretability and generalization.
The following features are categorized and will not be used in the train and test of the model. They will be dropped from the dataframe file.¶
features_to_drop = ['loan_amnt', 'int_rate', 'grade', 'home_ownership', 'annual_inc',
'verification_status', 'purpose', 'addr_state', 'dti',
'delinq_2yrs', 'fico_range_high', 'inq_last_6mths', 'revol_bal',
'total_acc', 'initial_list_status', 'out_prncp', 'last_pymnt_amnt',
'last_fico_range_high', 'collections_12_mths_ex_med',
'application_type', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m',
'open_act_il', 'open_il_12m', 'mths_since_rcnt_il', 'total_bal_il',
'open_rv_12m', 'max_bal_bc', 'total_rev_hi_lim', 'inq_fi',
'total_cu_tl', 'inq_last_12m', 'acc_open_past_24mths',
'avg_cur_bal', 'bc_open_to_buy', 'chargeoff_within_12_mths',
'delinq_amnt', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl',
'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_inq',
'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd',
'num_actv_rev_tl', 'num_bc_tl', 'num_tl_120dpd_2m', 'num_tl_30dpd',
'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'percent_bc_gt_75',
'pub_rec_bankruptcies', 'hardship_flag', 'disbursement_method',
'debt_settlement_flag', 'emp_length_int', 'term_int',
'mths_since_earliest_cr_line', 'months_since_last_credit_pull',
'delinq_record_risk_score', 'revol_bal_to_bc_limit', 'revol_bal_to_open_to_buy',
'total_bal_ex_mort_to_inc', 'total_balance_to_credit_ratio', 'rev_to_il_limit_ratio',
'principal_paid_ratio', 'total_public_records', 'total_il_high_credit_limit']
len(features_to_drop)
69
# Drop the original features.
df_inputs_prepr_copy = df_inputs_prepr.copy()
df_inputs_prepr = df_inputs_prepr.drop(columns = features_to_drop)
df_inputs_prepr.shape[1]
344
E. Save the train and test datasets of the processed data in csv files.¶
#####
# loan_data_inputs_train = df_inputs_prepr.copy()
#####
loan_data_inputs_test = df_inputs_prepr.copy()
# save and export datasets in csv format.
loan_data_inputs_train.to_csv('C:/Disc D/365DataScience/Credit risk modeling/loan_data_inputs_train.csv')
loan_data_targets_train.to_csv('C:/Disc D/365DataScience/Credit risk modeling/loan_data_targets_train.csv')
loan_data_inputs_test.to_csv('C:/Disc D/365DataScience/Credit risk modeling/loan_data_inputs_test.csv')
loan_data_targets_test.to_csv('C:/Disc D/365DataScience/Credit risk modeling/loan_data_targets_test.csv')
# shape of train dataset.
loan_data_inputs_train.shape
(1096932, 344)
# shape of test dataset.
loan_data_inputs_test.shape
(274234, 344)
Conclusion:¶
After completing the data cleaning and preparation steps, the dataset was refined and structured for modeling. This included filling or removing missing values, eliminating features with excessively high proportions of missing data, and constructing a reliable target variable.
The dataset was then split into training and testing sets based on temporal criteria to simulate real-world prediction scenarios. Categorical and continuous variables were processed using Weight of Evidence (WoE) encoding and multicollinearity analysis (VIF), ensuring interpretability and statistical soundness.
Feature engineering techniques were applied to create new informative variables and enhance the predictive power of the dataset.
The resulting processed datasets were exported and saved in CSV format for use in the subsequent stages of credit risk modeling.